[PATCH 00/11] remove aux CQE caches

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 00/11] remove aux CQE caches
@ 2024-03-15 15:29 Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 01/11] io_uring: fix poll_remove stalled req completion Pavel Begunkov
                   ` (13 more replies)
  0 siblings, 14 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

Patch 1 is a fix.

Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
misundertsandings of the flags and of the tw state. It'd be great to have
even without even w/o the rest.

8-11 mandate ctx locking for task_work and finally removes the CQE
caches, instead we post directly into the CQ. Note that the cache is
used by multishot auxiliary completions.

The nvme cmd change is tested with io_uring_passthrough.c, however
it doesn't seem there is anything in liburing exercising ublk paths.
How do we test it? It'd be great to have at least some basic tests
for it.

Pavel Begunkov (11):
  io_uring: fix poll_remove stalled req completion
  io_uring/cmd: kill one issue_flags to tw conversion
  io_uring/cmd: fix tw <-> issue_flags conversion
  io_uring/cmd: introduce io_uring_cmd_complete
  ublk: don't hard code IO_URING_F_UNLOCKED
  nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
  io_uring/rw: avoid punting to io-wq directly
  io_uring: force tw ctx locking
  io_uring: remove struct io_tw_state::locked
  io_uring: refactor io_fill_cqe_req_aux
  io_uring: get rid of intermediate aux cqe caches

 drivers/block/ublk_drv.c       |  18 ++---
 drivers/nvme/host/ioctl.c      |   9 ++-
 include/linux/io_uring/cmd.h   |  24 ++++++
 include/linux/io_uring_types.h |   5 +-
 io_uring/io_uring.c            | 132 +++++++++------------------------
 io_uring/io_uring.h            |   8 +-
 io_uring/net.c                 |   6 +-
 io_uring/poll.c                |   7 +-
 io_uring/rw.c                  |  18 +----
 io_uring/timeout.c             |   8 +-
 io_uring/uring_cmd.c           |  14 ++--
 io_uring/waitid.c              |   2 +-
 12 files changed, 95 insertions(+), 156 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/11] io_uring: fix poll_remove stalled req completion
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 02/11] io_uring/cmd: kill one issue_flags to tw conversion Pavel Begunkov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

Taking the ctx lock is not enough to use the deferred request completion
infrastructure, it'll get queued into the list but no one would expect
it there, so it will sit there until next io_submit_flush_completions().
It's hard to care about the cancellation path, so complete it via tw.

Fixes: ef7dfac51d8ed ("io_uring/poll: serialize poll linked timer start with poll removal")
Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/poll.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 5f779139cae1..6db1dcb2c797 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -996,7 +996,6 @@ int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_hash_bucket *bucket;
 	struct io_kiocb *preq;
 	int ret2, ret = 0;
-	struct io_tw_state ts = { .locked = true };
 
 	io_ring_submit_lock(ctx, issue_flags);
 	preq = io_poll_find(ctx, true, &cd, &ctx->cancel_table, &bucket);
@@ -1045,7 +1044,8 @@ int io_poll_remove(struct io_kiocb *req, unsigned int issue_flags)
 
 	req_set_fail(preq);
 	io_req_set_res(preq, -ECANCELED, 0);
-	io_req_task_complete(preq, &ts);
+	preq->io_task_work.func = io_req_task_complete;
+	io_req_task_work_add(preq);
 out:
 	io_ring_submit_unlock(ctx, issue_flags);
 	if (ret < 0) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/11] io_uring/cmd: kill one issue_flags to tw conversion
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 01/11] io_uring: fix poll_remove stalled req completion Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 03/11] io_uring/cmd: fix tw <-> issue_flags conversion Pavel Begunkov
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

io_uring cmd converts struct io_tw_state to issue_flags and later back
to io_tw_state, it's awfully ill-fated, not to mention that intermediate
issue_flags state is not correct.

Get rid of the last conversion, drag through tw everything that came
with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a
direct call to io_req_complete_defer(), at least for the time being.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/uring_cmd.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 42f63adfa54a..f197e8c22965 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -100,11 +100,11 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
 	if (req->ctx->flags & IORING_SETUP_IOPOLL) {
 		/* order with io_iopoll_req_issued() checking ->iopoll_complete */
 		smp_store_release(&req->iopoll_completed, 1);
+	} else if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+		io_req_complete_defer(req);
 	} else {
-		struct io_tw_state ts = {
-			.locked = !(issue_flags & IO_URING_F_UNLOCKED),
-		};
-		io_req_task_complete(req, &ts);
+		req->io_task_work.func = io_req_task_complete;
+		io_req_task_work_add(req);
 	}
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/11] io_uring/cmd: fix tw <-> issue_flags conversion
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 01/11] io_uring: fix poll_remove stalled req completion Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 02/11] io_uring/cmd: kill one issue_flags to tw conversion Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 04/11] io_uring/cmd: introduce io_uring_cmd_complete Pavel Begunkov
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

!IO_URING_F_UNLOCKED does not translate to availability of the deferred
completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should
pass and look for to use io_req_complete_defer() and other variants.

Luckily, it's not a real problem as two wrongs actually made it right,
at least as far as io_uring_cmd_work() goes.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/uring_cmd.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index f197e8c22965..ec38a8d4836d 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -56,7 +56,11 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
 static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-	unsigned issue_flags = ts->locked ? 0 : IO_URING_F_UNLOCKED;
+	unsigned issue_flags = IO_URING_F_UNLOCKED;
+
+	/* locked task_work executor checks the deffered list completion */
+	if (ts->locked)
+		issue_flags = IO_URING_F_COMPLETE_DEFER;
 
 	ioucmd->task_work_cb(ioucmd, issue_flags);
 }
@@ -100,7 +104,9 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
 	if (req->ctx->flags & IORING_SETUP_IOPOLL) {
 		/* order with io_iopoll_req_issued() checking ->iopoll_complete */
 		smp_store_release(&req->iopoll_completed, 1);
-	} else if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+	} else if (issue_flags & IO_URING_F_COMPLETE_DEFER) {
+		if (WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED))
+			return;
 		io_req_complete_defer(req);
 	} else {
 		req->io_task_work.func = io_req_task_complete;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/11] io_uring/cmd: introduce io_uring_cmd_complete
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 03/11] io_uring/cmd: fix tw <-> issue_flags conversion Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 05/11] ublk: don't hard code IO_URING_F_UNLOCKED Pavel Begunkov
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

io_uring_cmd_complete() does exactly what io_uring_cmd_done() does, that
is completing the request, but doesn't ask for issue_flags argument. We
have a couple of users hardcoding some random issue_flags values in
drivers, which they absolutely should not do. This function will be used
to get rid of them. Also, add comments warning users that they're only
allowed to pass issue_flags that were given from io_uring.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring/cmd.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index e453a997c060..9cbe986eab7d 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -26,12 +26,25 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
+
+/*
+ * Completes the request, i.e. posts an io_uring CQE and deallocates @ioucmd
+ * and a corresponding io_uring request.
+ *
+ * Note: the caller must never invent the  @issue_flags mask, it's only allowed
+ * to pass what has been provided by the core io_uring code.
+ */
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2,
 			unsigned issue_flags);
+
 void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
 			    void (*task_work_cb)(struct io_uring_cmd *, unsigned),
 			    unsigned flags);
 
+/*
+ * The caller must never invent the @issue_flags mask, it's only allowed
+ * to pass what has been provided by the core io_uring code.
+ */
 void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags);
 
@@ -56,6 +69,17 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 }
 #endif
 
+/*
+ * Completes the request, i.e. posts an io_uring CQE and deallocates @ioucmd
+ * and a corresponding io_uring request. Similar to io_uring_cmd_done() but
+ * without issue_flags argument.
+ */
+static inline void io_uring_cmd_complete(struct io_uring_cmd *ioucmd,
+					 ssize_t ret, ssize_t res2)
+{
+	io_uring_cmd_done(ioucmd, ret, res2, IO_URING_F_UNLOCKED);
+}
+
 /* users must follow the IOU_F_TWQ_LAZY_WAKE semantics */
 static inline void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd,
 			void (*task_work_cb)(struct io_uring_cmd *, unsigned))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/11] ublk: don't hard code IO_URING_F_UNLOCKED
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (3 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 04/11] io_uring/cmd: introduce io_uring_cmd_complete Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 06/11] nvme/io_uring: " Pavel Begunkov
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

uring_cmd implementations should not try to guess issue_flags, just use
a newly added io_uring_cmd_complete(). We're loosing an optimisation in
the cancellation path in ublk_uring_cmd_cancel_fn(), but the assumption
is that we don't care that much about it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 drivers/block/ublk_drv.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index bea3d5cf8a83..97dceecadab2 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1417,8 +1417,7 @@ static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
 	return true;
 }
 
-static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
-		unsigned int issue_flags)
+static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io)
 {
 	bool done;
 
@@ -1432,15 +1431,14 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
 	spin_unlock(&ubq->cancel_lock);
 
 	if (!done)
-		io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0, issue_flags);
+		io_uring_cmd_complete(io->cmd, UBLK_IO_RES_ABORT, 0);
 }
 
 /*
  * The ublk char device won't be closed when calling cancel fn, so both
  * ublk device and queue are guaranteed to be live
  */
-static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
-		unsigned int issue_flags)
+static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd)
 {
 	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
 	struct ublk_queue *ubq = pdu->ubq;
@@ -1464,7 +1462,7 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
 
 	io = &ubq->ios[pdu->tag];
 	WARN_ON_ONCE(io->cmd != cmd);
-	ublk_cancel_cmd(ubq, io, issue_flags);
+	ublk_cancel_cmd(ubq, io);
 
 	if (need_schedule) {
 		if (ublk_can_use_recovery(ub))
@@ -1484,7 +1482,7 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
 	int i;
 
 	for (i = 0; i < ubq->q_depth; i++)
-		ublk_cancel_cmd(ubq, &ubq->ios[i], IO_URING_F_UNLOCKED);
+		ublk_cancel_cmd(ubq, &ubq->ios[i]);
 }
 
 /* Cancel all pending commands, must be called after del_gendisk() returns */
@@ -1777,7 +1775,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	return -EIOCBQUEUED;
 
  out:
-	io_uring_cmd_done(cmd, ret, 0, issue_flags);
+	io_uring_cmd_complete(cmd, ret, 0);
 	pr_devel("%s: complete: cmd op %d, tag %d ret %x io_flags %x\n",
 			__func__, cmd_op, tag, ret, io->flags);
 	return -EIOCBQUEUED;
@@ -1842,7 +1840,7 @@ static void ublk_ch_uring_cmd_cb(struct io_uring_cmd *cmd,
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	if (unlikely(issue_flags & IO_URING_F_CANCEL)) {
-		ublk_uring_cmd_cancel_fn(cmd, issue_flags);
+		ublk_uring_cmd_cancel_fn(cmd);
 		return 0;
 	}
 
@@ -2930,7 +2928,7 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 	if (ub)
 		ublk_put_device(ub);
  out:
-	io_uring_cmd_done(cmd, ret, 0, issue_flags);
+	io_uring_cmd_complete(cmd, ret, 0);
 	pr_devel("%s: cmd done ret %d cmd_op %x, dev id %d qid %d\n",
 			__func__, ret, cmd->cmd_op, header->dev_id, header->queue_id);
 	return -EIOCBQUEUED;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/11] nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (4 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 05/11] ublk: don't hard code IO_URING_F_UNLOCKED Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 07/11] io_uring/rw: avoid punting to io-wq directly Pavel Begunkov
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

uring_cmd implementations should not try to guess issue_flags, use a
freshly added helper io_uring_cmd_complete() instead.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 drivers/nvme/host/ioctl.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 3dfd5ae99ae0..1a7b5af42dbc 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -426,10 +426,13 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
 	 * For iopoll, complete it directly.
 	 * Otherwise, move the completion to task work.
 	 */
-	if (blk_rq_is_poll(req))
-		nvme_uring_task_cb(ioucmd, IO_URING_F_UNLOCKED);
-	else
+	if (blk_rq_is_poll(req)) {
+		if (pdu->bio)
+			blk_rq_unmap_user(pdu->bio);
+		io_uring_cmd_complete(ioucmd, pdu->status, pdu->result);
+	} else {
 		io_uring_cmd_do_in_task_lazy(ioucmd, nvme_uring_task_cb);
+	}
 
 	return RQ_END_IO_FREE;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/11] io_uring/rw: avoid punting to io-wq directly
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (5 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 06/11] nvme/io_uring: " Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:29 ` [PATCH 08/11] io_uring: force tw ctx locking Pavel Begunkov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 8 ++++----
 io_uring/io_uring.h | 1 -
 io_uring/rw.c       | 8 +-------
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3ae4bb988906..4ad85460ed2a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -498,7 +498,7 @@ static void io_prep_async_link(struct io_kiocb *req)
 	}
 }
 
-void io_queue_iowq(struct io_kiocb *req, struct io_tw_state *ts_dont_use)
+static void io_queue_iowq(struct io_kiocb *req)
 {
 	struct io_kiocb *link = io_prep_linked_timeout(req);
 	struct io_uring_task *tctx = req->task->io_uring;
@@ -1505,7 +1505,7 @@ void io_req_task_submit(struct io_kiocb *req, struct io_tw_state *ts)
 	if (unlikely(req->task->flags & PF_EXITING))
 		io_req_defer_failed(req, -EFAULT);
 	else if (req->flags & REQ_F_FORCE_ASYNC)
-		io_queue_iowq(req, ts);
+		io_queue_iowq(req);
 	else
 		io_queue_sqe(req);
 }
@@ -2088,7 +2088,7 @@ static void io_queue_async(struct io_kiocb *req, int ret)
 		break;
 	case IO_APOLL_ABORTED:
 		io_kbuf_recycle(req, 0);
-		io_queue_iowq(req, NULL);
+		io_queue_iowq(req);
 		break;
 	case IO_APOLL_OK:
 		break;
@@ -2135,7 +2135,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 		if (unlikely(req->ctx->drain_active))
 			io_drain_req(req);
 		else
-			io_queue_iowq(req, NULL);
+			io_queue_iowq(req);
 	}
 }
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 6426ee382276..472ba5692ba8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -79,7 +79,6 @@ struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
 void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
-void io_queue_iowq(struct io_kiocb *req, struct io_tw_state *ts_dont_use);
 void io_req_task_complete(struct io_kiocb *req, struct io_tw_state *ts);
 void io_req_task_queue_fail(struct io_kiocb *req, int ret);
 void io_req_task_submit(struct io_kiocb *req, struct io_tw_state *ts);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 0585ebcc9773..576934dbf833 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -187,12 +187,6 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
 	return NULL;
 }
 
-static void io_req_task_queue_reissue(struct io_kiocb *req)
-{
-	req->io_task_work.func = io_queue_iowq;
-	io_req_task_work_add(req);
-}
-
 #ifdef CONFIG_BLOCK
 static bool io_resubmit_prep(struct io_kiocb *req)
 {
@@ -405,7 +399,7 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
 	if (req->flags & REQ_F_REISSUE) {
 		req->flags &= ~REQ_F_REISSUE;
 		if (io_resubmit_prep(req))
-			io_req_task_queue_reissue(req);
+			return -EAGAIN;
 		else
 			io_req_task_queue_fail(req, final_ret);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/11] io_uring: force tw ctx locking
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (6 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 07/11] io_uring/rw: avoid punting to io-wq directly Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:40   ` Jens Axboe
  2024-03-15 15:29 ` [PATCH 09/11] io_uring: remove struct io_tw_state::locked Pavel Begunkov
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.

In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ad85460ed2a..0cef5c4ddc98 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1191,8 +1191,9 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 		if (req->ctx != ctx) {
 			ctx_flush_and_put(ctx, &ts);
 			ctx = req->ctx;
-			/* if not contended, grab and improve batching */
-			ts.locked = mutex_trylock(&ctx->uring_lock);
+
+			ts.locked = true;
+			mutex_lock(&ctx->uring_lock);
 			percpu_ref_get(&ctx->refs);
 		}
 		INDIRECT_CALL_2(req->io_task_work.func,
@@ -1453,11 +1454,9 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
-	if (ts->locked) {
-		io_submit_flush_completions(ctx);
-		if (io_run_local_work_continue(ctx, ret, min_events))
-			goto again;
-	}
+	io_submit_flush_completions(ctx);
+	if (io_run_local_work_continue(ctx, ret, min_events))
+		goto again;
 
 	trace_io_uring_local_work_run(ctx, ret, loops);
 	return ret;
@@ -1481,14 +1480,12 @@ static inline int io_run_local_work_locked(struct io_ring_ctx *ctx,
 
 static int io_run_local_work(struct io_ring_ctx *ctx, int min_events)
 {
-	struct io_tw_state ts = {};
+	struct io_tw_state ts = { .locked = true };
 	int ret;
 
-	ts.locked = mutex_trylock(&ctx->uring_lock);
+	mutex_lock(&ctx->uring_lock);
 	ret = __io_run_local_work(ctx, &ts, min_events);
-	if (ts.locked)
-		mutex_unlock(&ctx->uring_lock);
-
+	mutex_unlock(&ctx->uring_lock);
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/11] io_uring: remove struct io_tw_state::locked
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (7 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 08/11] io_uring: force tw ctx locking Pavel Begunkov
@ 2024-03-15 15:29 ` Pavel Begunkov
  2024-03-15 15:30 ` [PATCH 10/11] io_uring: refactor io_fill_cqe_req_aux Pavel Begunkov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:29 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  2 --
 io_uring/io_uring.c            | 31 ++++++++-----------------------
 io_uring/io_uring.h            |  5 +----
 io_uring/poll.c                |  2 +-
 io_uring/rw.c                  |  6 ++----
 io_uring/timeout.c             |  8 ++------
 io_uring/uring_cmd.c           |  8 ++------
 io_uring/waitid.c              |  2 +-
 8 files changed, 17 insertions(+), 47 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e24893625085..5a2afbc93887 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -439,8 +439,6 @@ struct io_ring_ctx {
 };
 
 struct io_tw_state {
-	/* ->uring_lock is taken, callbacks can use io_tw_lock to lock it */
-	bool locked;
 };
 
 enum {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0cef5c4ddc98..1c4bbfc411d1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -252,14 +252,12 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 						fallback_work.work);
 	struct llist_node *node = llist_del_all(&ctx->fallback_llist);
 	struct io_kiocb *req, *tmp;
-	struct io_tw_state ts = { .locked = true, };
+	struct io_tw_state ts = {};
 
 	percpu_ref_get(&ctx->refs);
 	mutex_lock(&ctx->uring_lock);
 	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
 		req->io_task_work.func(req, &ts);
-	if (WARN_ON_ONCE(!ts.locked))
-		return;
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	percpu_ref_put(&ctx->refs);
@@ -1163,11 +1161,9 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, struct io_tw_state *ts)
 		return;
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
-	if (ts->locked) {
-		io_submit_flush_completions(ctx);
-		mutex_unlock(&ctx->uring_lock);
-		ts->locked = false;
-	}
+
+	io_submit_flush_completions(ctx);
+	mutex_unlock(&ctx->uring_lock);
 	percpu_ref_put(&ctx->refs);
 }
 
@@ -1191,8 +1187,6 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 		if (req->ctx != ctx) {
 			ctx_flush_and_put(ctx, &ts);
 			ctx = req->ctx;
-
-			ts.locked = true;
 			mutex_lock(&ctx->uring_lock);
 			percpu_ref_get(&ctx->refs);
 		}
@@ -1465,22 +1459,16 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, struct io_tw_state *ts,
 static inline int io_run_local_work_locked(struct io_ring_ctx *ctx,
 					   int min_events)
 {
-	struct io_tw_state ts = { .locked = true, };
-	int ret;
+	struct io_tw_state ts = {};
 
 	if (llist_empty(&ctx->work_llist))
 		return 0;
-
-	ret = __io_run_local_work(ctx, &ts, min_events);
-	/* shouldn't happen! */
-	if (WARN_ON_ONCE(!ts.locked))
-		mutex_lock(&ctx->uring_lock);
-	return ret;
+	return __io_run_local_work(ctx, &ts, min_events);
 }
 
 static int io_run_local_work(struct io_ring_ctx *ctx, int min_events)
 {
-	struct io_tw_state ts = { .locked = true };
+	struct io_tw_state ts = {};
 	int ret;
 
 	mutex_lock(&ctx->uring_lock);
@@ -1708,10 +1696,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 
 void io_req_task_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
-	if (ts->locked)
-		io_req_complete_defer(req);
-	else
-		io_req_complete_post(req, IO_URING_F_UNLOCKED);
+	io_req_complete_defer(req);
 }
 
 /*
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 472ba5692ba8..6cad3ef3408b 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -344,10 +344,7 @@ static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 
 static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
 {
-	if (!ts->locked) {
-		mutex_lock(&ctx->uring_lock);
-		ts->locked = true;
-	}
+	lockdep_assert_held(&ctx->uring_lock);
 }
 
 /*
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 6db1dcb2c797..8901dd118e50 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -322,7 +322,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
 			__poll_t mask = mangle_poll(req->cqe.res &
 						    req->apoll_events);
 
-			if (!io_fill_cqe_req_aux(req, ts->locked, mask,
+			if (!io_fill_cqe_req_aux(req, true, mask,
 						 IORING_CQE_F_MORE)) {
 				io_req_set_res(req, mask, 0);
 				return IOU_POLL_REMOVE_POLL_USE_RES;
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 576934dbf833..c7f9246ff508 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -305,11 +305,9 @@ void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 
 	io_req_io_end(req);
 
-	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
-		unsigned issue_flags = ts->locked ? 0 : IO_URING_F_UNLOCKED;
+	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING))
+		req->cqe.flags |= io_put_kbuf(req, 0);
 
-		req->cqe.flags |= io_put_kbuf(req, issue_flags);
-	}
 	io_req_task_complete(req, ts);
 }
 
diff --git a/io_uring/timeout.c b/io_uring/timeout.c
index 7fd7dbb211d6..0a48e6acd0b2 100644
--- a/io_uring/timeout.c
+++ b/io_uring/timeout.c
@@ -72,10 +72,7 @@ static void io_timeout_complete(struct io_kiocb *req, struct io_tw_state *ts)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (!io_timeout_finish(timeout, data)) {
-		bool filled;
-		filled = io_fill_cqe_req_aux(req, ts->locked, -ETIME,
-					     IORING_CQE_F_MORE);
-		if (filled) {
+		if (io_fill_cqe_req_aux(req, true, -ETIME, IORING_CQE_F_MORE)) {
 			/* re-arm timer */
 			spin_lock_irq(&ctx->timeout_lock);
 			list_add(&timeout->list, ctx->timeout_list.prev);
@@ -301,7 +298,6 @@ int io_timeout_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd)
 
 static void io_req_task_link_timeout(struct io_kiocb *req, struct io_tw_state *ts)
 {
-	unsigned issue_flags = ts->locked ? 0 : IO_URING_F_UNLOCKED;
 	struct io_timeout *timeout = io_kiocb_to_cmd(req, struct io_timeout);
 	struct io_kiocb *prev = timeout->prev;
 	int ret = -ENOENT;
@@ -313,7 +309,7 @@ static void io_req_task_link_timeout(struct io_kiocb *req, struct io_tw_state *t
 				.data		= prev->cqe.user_data,
 			};
 
-			ret = io_try_cancel(req->task->io_uring, &cd, issue_flags);
+			ret = io_try_cancel(req->task->io_uring, &cd, 0);
 		}
 		io_req_set_res(req, ret ?: -ETIME, 0);
 		io_req_task_complete(req, ts);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index ec38a8d4836d..e45d4cd5ef82 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -56,13 +56,9 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
 static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-	unsigned issue_flags = IO_URING_F_UNLOCKED;
 
-	/* locked task_work executor checks the deffered list completion */
-	if (ts->locked)
-		issue_flags = IO_URING_F_COMPLETE_DEFER;
-
-	ioucmd->task_work_cb(ioucmd, issue_flags);
+	/* task_work executor checks the deffered list completion */
+	ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
 }
 
 void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
diff --git a/io_uring/waitid.c b/io_uring/waitid.c
index 6f851978606d..791ea752b0b1 100644
--- a/io_uring/waitid.c
+++ b/io_uring/waitid.c
@@ -118,7 +118,7 @@ static int io_waitid_finish(struct io_kiocb *req, int ret)
 static void io_waitid_complete(struct io_kiocb *req, int ret)
 {
 	struct io_waitid *iw = io_kiocb_to_cmd(req, struct io_waitid);
-	struct io_tw_state ts = { .locked = true };
+	struct io_tw_state ts = {};
 
 	/* anyone completing better be holding a reference */
 	WARN_ON_ONCE(!(atomic_read(&iw->refs) & IO_WAITID_REF_MASK));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/11] io_uring: refactor io_fill_cqe_req_aux
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (8 preceding siblings ...)
  2024-03-15 15:29 ` [PATCH 09/11] io_uring: remove struct io_tw_state::locked Pavel Begunkov
@ 2024-03-15 15:30 ` Pavel Begunkov
  2024-03-15 15:30 ` [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches Pavel Begunkov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:30 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.

Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 16 +++-------------
 io_uring/io_uring.h |  2 +-
 io_uring/net.c      |  6 ++----
 io_uring/poll.c     |  3 +--
 io_uring/rw.c       |  4 +---
 io_uring/timeout.c  |  2 +-
 6 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1c4bbfc411d1..167a3429a056 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -913,40 +913,30 @@ static void __io_flush_post_cqes(struct io_ring_ctx *ctx)
 	state->cqes_count = 0;
 }
 
-static bool __io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
-			      bool allow_overflow)
+bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
 {
 	bool filled;
 
 	io_cq_lock(ctx);
 	filled = io_fill_cqe_aux(ctx, user_data, res, cflags);
-	if (!filled && allow_overflow)
+	if (!filled)
 		filled = io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0);
 
 	io_cq_unlock_post(ctx);
 	return filled;
 }
 
-bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
-{
-	return __io_post_aux_cqe(ctx, user_data, res, cflags, true);
-}
-
 /*
  * A helper for multishot requests posting additional CQEs.
  * Should only be used from a task_work including IO_URING_F_MULTISHOT.
  */
-bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags)
+bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	u64 user_data = req->cqe.user_data;
 	struct io_uring_cqe *cqe;
 
 	lockdep_assert(!io_wq_current_is_worker());
-
-	if (!defer)
-		return __io_post_aux_cqe(ctx, user_data, res, cflags, false);
-
 	lockdep_assert_held(&ctx->uring_lock);
 
 	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->completion_cqes)) {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 6cad3ef3408b..4bc96470e591 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -67,7 +67,7 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
-bool io_fill_cqe_req_aux(struct io_kiocb *req, bool defer, s32 res, u32 cflags);
+bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
 struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
diff --git a/io_uring/net.c b/io_uring/net.c
index 19451f0dbf81..b2890eeea6a8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -699,8 +699,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 	 * receive from this socket.
 	 */
 	if ((req->flags & REQ_F_APOLL_MULTISHOT) && !mshot_finished &&
-	    io_fill_cqe_req_aux(req, issue_flags & IO_URING_F_COMPLETE_DEFER,
-				*ret, cflags | IORING_CQE_F_MORE)) {
+	    io_req_post_cqe(req, *ret, cflags | IORING_CQE_F_MORE)) {
 		struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 		int mshot_retry_ret = IOU_ISSUE_SKIP_COMPLETE;
 
@@ -1421,8 +1420,7 @@ int io_accept(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (ret < 0)
 		return ret;
-	if (io_fill_cqe_req_aux(req, issue_flags & IO_URING_F_COMPLETE_DEFER,
-				ret, IORING_CQE_F_MORE))
+	if (io_req_post_cqe(req, ret, IORING_CQE_F_MORE))
 		goto retry;
 
 	io_req_set_res(req, ret, 0);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 8901dd118e50..5d55bbf1de15 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -322,8 +322,7 @@ static int io_poll_check_events(struct io_kiocb *req, struct io_tw_state *ts)
 			__poll_t mask = mangle_poll(req->cqe.res &
 						    req->apoll_events);
 
-			if (!io_fill_cqe_req_aux(req, true, mask,
-						 IORING_CQE_F_MORE)) {
+			if (!io_req_post_cqe(req, mask, IORING_CQE_F_MORE)) {
 				io_req_set_res(req, mask, 0);
 				return IOU_POLL_REMOVE_POLL_USE_RES;
 			}
diff --git a/io_uring/rw.c b/io_uring/rw.c
index c7f9246ff508..35216e8adc29 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -955,9 +955,7 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags)
 		cflags = io_put_kbuf(req, issue_flags);
 		rw->len = 0; /* similarly to above, reset len to 0 */
 
-		if (io_fill_cqe_req_aux(req,
-					issue_flags & IO_URING_F_COMPLETE_DEFER,
-					ret, cflags | IORING_CQE_F_MORE)) {
+		if (io_req_post_cqe(req, ret, cflags | IORING_CQE_F_MORE)) {
 			if (issue_flags & IO_URING_F_MULTISHOT) {
 				/*
 				 * Force retry, as we might have more data to
diff --git a/io_uring/timeout.c b/io_uring/timeout.c
index 0a48e6acd0b2..3458ca550b83 100644
--- a/io_uring/timeout.c
+++ b/io_uring/timeout.c
@@ -72,7 +72,7 @@ static void io_timeout_complete(struct io_kiocb *req, struct io_tw_state *ts)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (!io_timeout_finish(timeout, data)) {
-		if (io_fill_cqe_req_aux(req, true, -ETIME, IORING_CQE_F_MORE)) {
+		if (io_req_post_cqe(req, -ETIME, IORING_CQE_F_MORE)) {
 			/* re-arm timer */
 			spin_lock_irq(&ctx->timeout_lock);
 			list_add(&timeout->list, ctx->timeout_list.prev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (9 preceding siblings ...)
  2024-03-15 15:30 ` [PATCH 10/11] io_uring: refactor io_fill_cqe_req_aux Pavel Begunkov
@ 2024-03-15 15:30 ` Pavel Begunkov
  2024-03-15 16:20   ` Jens Axboe
  2024-03-15 15:42 ` [PATCH 00/11] remove aux CQE caches Jens Axboe
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 15:30 UTC (permalink / raw)
  To: io-uring; +Cc: linux-block, Jens Axboe, asml.silence, Kanchan Joshi, Ming Lei

io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.

DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  3 +-
 io_uring/io_uring.c            | 64 ++++++++--------------------------
 2 files changed, 15 insertions(+), 52 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 5a2afbc93887..ea7e5488b3be 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -205,6 +205,7 @@ struct io_submit_state {
 
 	bool			plug_started;
 	bool			need_plug;
+	bool			cq_flush;
 	unsigned short		submit_nr;
 	unsigned int		cqes_count;
 	struct blk_plug		plug;
@@ -342,8 +343,6 @@ struct io_ring_ctx {
 		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
-	struct io_uring_cqe	completion_cqes[16];
-
 	spinlock_t		completion_lock;
 
 	/* IRQ completion list, under ->completion_lock */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 167a3429a056..023fcf5d52c1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -176,7 +176,7 @@ static struct ctl_table kernel_io_uring_disabled_table[] = {
 static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 {
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs) ||
-	    ctx->submit_state.cqes_count)
+	    ctx->submit_state.cq_flush)
 		__io_submit_flush_completions(ctx);
 }
 
@@ -636,6 +636,12 @@ static inline void __io_cq_lock(struct io_ring_ctx *ctx)
 		spin_lock(&ctx->completion_lock);
 }
 
+static inline void __io_cq_unlock(struct io_ring_ctx *ctx)
+{
+	if (!ctx->lockless_cq)
+		spin_unlock(&ctx->completion_lock);
+}
+
 static inline void io_cq_lock(struct io_ring_ctx *ctx)
 	__acquires(ctx->completion_lock)
 {
@@ -888,31 +894,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 	return false;
 }
 
-static void __io_flush_post_cqes(struct io_ring_ctx *ctx)
-	__must_hold(&ctx->uring_lock)
-{
-	struct io_submit_state *state = &ctx->submit_state;
-	unsigned int i;
-
-	lockdep_assert_held(&ctx->uring_lock);
-	for (i = 0; i < state->cqes_count; i++) {
-		struct io_uring_cqe *cqe = &ctx->completion_cqes[i];
-
-		if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) {
-			if (ctx->lockless_cq) {
-				spin_lock(&ctx->completion_lock);
-				io_cqring_event_overflow(ctx, cqe->user_data,
-							cqe->res, cqe->flags, 0, 0);
-				spin_unlock(&ctx->completion_lock);
-			} else {
-				io_cqring_event_overflow(ctx, cqe->user_data,
-							cqe->res, cqe->flags, 0, 0);
-			}
-		}
-	}
-	state->cqes_count = 0;
-}
-
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags)
 {
 	bool filled;
@@ -933,31 +914,16 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	u64 user_data = req->cqe.user_data;
-	struct io_uring_cqe *cqe;
+	bool posted;
 
 	lockdep_assert(!io_wq_current_is_worker());
 	lockdep_assert_held(&ctx->uring_lock);
 
-	if (ctx->submit_state.cqes_count == ARRAY_SIZE(ctx->completion_cqes)) {
-		__io_cq_lock(ctx);
-		__io_flush_post_cqes(ctx);
-		/* no need to flush - flush is deferred */
-		__io_cq_unlock_post(ctx);
-	}
-
-	/* For defered completions this is not as strict as it is otherwise,
-	 * however it's main job is to prevent unbounded posted completions,
-	 * and in that it works just as well.
-	 */
-	if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))
-		return false;
-
-	cqe = &ctx->completion_cqes[ctx->submit_state.cqes_count++];
-	cqe->user_data = user_data;
-	cqe->res = res;
-	cqe->flags = cflags;
-	return true;
+	__io_cq_lock(ctx);
+	posted = io_fill_cqe_aux(ctx, req->cqe.user_data, res, cflags);
+	ctx->submit_state.cq_flush = true;
+	__io_cq_unlock_post(ctx);
+	return posted;
 }
 
 static void __io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
@@ -1551,9 +1517,6 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	struct io_wq_work_node *node;
 
 	__io_cq_lock(ctx);
-	/* must come first to preserve CQE ordering in failure cases */
-	if (state->cqes_count)
-		__io_flush_post_cqes(ctx);
 	__wq_list_for_each(node, &state->compl_reqs) {
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 					    comp_list);
@@ -1575,6 +1538,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		io_free_batch_list(ctx, state->compl_reqs.first);
 		INIT_WQ_LIST(&state->compl_reqs);
 	}
+	ctx->submit_state.cq_flush = false;
 }
 
 static unsigned io_cqring_events(struct io_ring_ctx *ctx)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 08/11] io_uring: force tw ctx locking
  2024-03-15 15:29 ` [PATCH 08/11] io_uring: force tw ctx locking Pavel Begunkov
@ 2024-03-15 15:40   ` Jens Axboe
  2024-03-15 16:14     ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 15:40 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 9:29 AM, Pavel Begunkov wrote:
> We can run normal task_work without locking the ctx, however we try to
> lock anyway and most handlers prefer or require it locked. It might have
> been interesting to multi-submitter ring with high contention completing
> async read/write requests via task_work, however that will still need to
> go through io_req_complete_post() and potentially take the lock for
> rsrc node putting or some other case.
> 
> In other words, it's hard to care about it, so alawys force the locking.
> The case described would also because of various io_uring caches.

This is a good idea, I've had that thought myself too. The conditional
aspect of it is annoying, and by far the most interesting use cases will
do the locking anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/11] remove aux CQE caches
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (10 preceding siblings ...)
  2024-03-15 15:30 ` [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches Pavel Begunkov
@ 2024-03-15 15:42 ` Jens Axboe
  2024-03-15 16:00 ` Jens Axboe
  2024-03-15 22:53 ` (subset) " Jens Axboe
  13 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 15:42 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 9:29 AM, Pavel Begunkov wrote:
> Patch 1 is a fix.
> 
> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> misundertsandings of the flags and of the tw state. It'd be great to have
> even without even w/o the rest.
> 
> 8-11 mandate ctx locking for task_work and finally removes the CQE
> caches, instead we post directly into the CQ. Note that the cache is
> used by multishot auxiliary completions.

I love this series! I'll push patch 1 for 6.9, and then run some testing
with the rest for 6.10.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/11] remove aux CQE caches
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (11 preceding siblings ...)
  2024-03-15 15:42 ` [PATCH 00/11] remove aux CQE caches Jens Axboe
@ 2024-03-15 16:00 ` Jens Axboe
  2024-03-15 22:53 ` (subset) " Jens Axboe
  13 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:00 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 9:29 AM, Pavel Begunkov wrote:
> The nvme cmd change is tested with io_uring_passthrough.c, however
> it doesn't seem there is anything in liburing exercising ublk paths.
> How do we test it? It'd be great to have at least some basic tests
> for it.

Forgot to comment on this... Last time I tested it, I just pulled the
ublk repo and setup a device and did some IO. It would indeed be nice to
have some basic ublk tests in liburing, however.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 08/11] io_uring: force tw ctx locking
  2024-03-15 15:40   ` Jens Axboe
@ 2024-03-15 16:14     ` Pavel Begunkov
  0 siblings, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 16:14 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 15:40, Jens Axboe wrote:
> On 3/15/24 9:29 AM, Pavel Begunkov wrote:
>> We can run normal task_work without locking the ctx, however we try to
>> lock anyway and most handlers prefer or require it locked. It might have
>> been interesting to multi-submitter ring with high contention completing
>> async read/write requests via task_work, however that will still need to
>> go through io_req_complete_post() and potentially take the lock for
>> rsrc node putting or some other case.
>>
>> In other words, it's hard to care about it, so alawys force the locking.
>> The case described would also because of various io_uring caches.
> 
> This is a good idea, I've had that thought myself too. The conditional
> aspect of it is annoying, and by far the most interesting use cases will
> do the locking anyway.

It floated up around a year ago and even before that in my head,
but these days it's just completely loosing actuality. And the
rules would be simpler, req->task context (syscall & tw) means
it's locked, unlocked for io-wq.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 15:30 ` [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches Pavel Begunkov
@ 2024-03-15 16:20   ` Jens Axboe
  2024-03-15 16:23     ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:20 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 9:30 AM, Pavel Begunkov wrote:
> io_post_aux_cqe(), which is used for multishot requests, delays
> completions by putting CQEs into a temporary array for the purpose
> completion lock/flush batching.
> 
> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
> directly into the CQ and defer post completion handling with a flag.
> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
> multishot requests, so have conditional locking with deferred flush
> for them.

This breaks the read-mshot test case, looking into what is going on
there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:20   ` Jens Axboe
@ 2024-03-15 16:23     ` Pavel Begunkov
  2024-03-15 16:25       ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 16:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 16:20, Jens Axboe wrote:
> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>> io_post_aux_cqe(), which is used for multishot requests, delays
>> completions by putting CQEs into a temporary array for the purpose
>> completion lock/flush batching.
>>
>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>> directly into the CQ and defer post completion handling with a flag.
>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>> multishot requests, so have conditional locking with deferred flush
>> for them.
> 
> This breaks the read-mshot test case, looking into what is going on
> there.

I forgot to mention, yes it does, the test makes odd assumptions about
overflows, IIRC it expects that the kernel allows one and only one aux
CQE to be overflown. Let me double check

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:23     ` Pavel Begunkov
@ 2024-03-15 16:25       ` Jens Axboe
  2024-03-15 16:27         ` Jens Axboe
  2024-03-15 16:29         ` Pavel Begunkov
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:25 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 10:23 AM, Pavel Begunkov wrote:
> On 3/15/24 16:20, Jens Axboe wrote:
>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>> completions by putting CQEs into a temporary array for the purpose
>>> completion lock/flush batching.
>>>
>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>> directly into the CQ and defer post completion handling with a flag.
>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>> multishot requests, so have conditional locking with deferred flush
>>> for them.
>>
>> This breaks the read-mshot test case, looking into what is going on
>> there.
> 
> I forgot to mention, yes it does, the test makes odd assumptions about
> overflows, IIRC it expects that the kernel allows one and only one aux
> CQE to be overflown. Let me double check

Yeah this is very possible, the overflow checking could be broken in
there. I'll poke at it and report back.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:25       ` Jens Axboe
@ 2024-03-15 16:27         ` Jens Axboe
  2024-03-15 16:44           ` Pavel Begunkov
  2024-03-15 16:29         ` Pavel Begunkov
  1 sibling, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:27 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 10:25 AM, Jens Axboe wrote:
> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>> On 3/15/24 16:20, Jens Axboe wrote:
>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>> completions by putting CQEs into a temporary array for the purpose
>>>> completion lock/flush batching.
>>>>
>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>> directly into the CQ and defer post completion handling with a flag.
>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>> multishot requests, so have conditional locking with deferred flush
>>>> for them.
>>>
>>> This breaks the read-mshot test case, looking into what is going on
>>> there.
>>
>> I forgot to mention, yes it does, the test makes odd assumptions about
>> overflows, IIRC it expects that the kernel allows one and only one aux
>> CQE to be overflown. Let me double check
> 
> Yeah this is very possible, the overflow checking could be broken in
> there. I'll poke at it and report back.

It does, this should fix it:


diff --git a/test/read-mshot.c b/test/read-mshot.c
index 8fcb79857bf0..501ca69a98dc 100644
--- a/test/read-mshot.c
+++ b/test/read-mshot.c
@@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
 		}
 		if (!(cqe->flags & IORING_CQE_F_MORE)) {
 			/* we expect this on overflow */
-			if (overflow && (i - 1 == NR_OVERFLOW))
+			if (overflow && i >= NR_OVERFLOW)
 				break;
 			fprintf(stderr, "no more cqes\n");
 			return 1;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:25       ` Jens Axboe
  2024-03-15 16:27         ` Jens Axboe
@ 2024-03-15 16:29         ` Pavel Begunkov
  2024-03-15 16:33           ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 16:29 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 16:25, Jens Axboe wrote:
> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>> On 3/15/24 16:20, Jens Axboe wrote:
>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>> completions by putting CQEs into a temporary array for the purpose
>>>> completion lock/flush batching.
>>>>
>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>> directly into the CQ and defer post completion handling with a flag.
>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>> multishot requests, so have conditional locking with deferred flush
>>>> for them.
>>>
>>> This breaks the read-mshot test case, looking into what is going on
>>> there.
>>
>> I forgot to mention, yes it does, the test makes odd assumptions about
>> overflows, IIRC it expects that the kernel allows one and only one aux
>> CQE to be overflown. Let me double check
> 
> Yeah this is very possible, the overflow checking could be broken in
> there. I'll poke at it and report back.

test() {
	if (!(cqe->flags & IORING_CQE_F_MORE)) {
		/* we expect this on overflow */
		if (overflow && (i - 1 == NR_OVERFLOW))
			break;
		fprintf(stderr, "no more cqes\n");
		return 1;
	}
	...
}

It's this chunk. I think I silenced it with

s/i - 1 == NR_OVERFLOW/i == NR_OVERFLOW/

but it should probably be i >= NR_OVERFLOW or so

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:29         ` Pavel Begunkov
@ 2024-03-15 16:33           ` Jens Axboe
  0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:33 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 10:29 AM, Pavel Begunkov wrote:
> On 3/15/24 16:25, Jens Axboe wrote:
>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>> completion lock/flush batching.
>>>>>
>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>> multishot requests, so have conditional locking with deferred flush
>>>>> for them.
>>>>
>>>> This breaks the read-mshot test case, looking into what is going on
>>>> there.
>>>
>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>> CQE to be overflown. Let me double check
>>
>> Yeah this is very possible, the overflow checking could be broken in
>> there. I'll poke at it and report back.
> 
> test() {
>     if (!(cqe->flags & IORING_CQE_F_MORE)) {
>         /* we expect this on overflow */
>         if (overflow && (i - 1 == NR_OVERFLOW))
>             break;
>         fprintf(stderr, "no more cqes\n");
>         return 1;
>     }
>     ...
> }
> 
> It's this chunk. I think I silenced it with
> 
> s/i - 1 == NR_OVERFLOW/i == NR_OVERFLOW/
> 
> but it should probably be i >= NR_OVERFLOW or so

Yeah see other email, I did the latter. It's pushed out.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:27         ` Jens Axboe
@ 2024-03-15 16:44           ` Pavel Begunkov
  2024-03-15 16:49             ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 16:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 16:27, Jens Axboe wrote:
> On 3/15/24 10:25 AM, Jens Axboe wrote:
>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>> completion lock/flush batching.
>>>>>
>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>> multishot requests, so have conditional locking with deferred flush
>>>>> for them.
>>>>
>>>> This breaks the read-mshot test case, looking into what is going on
>>>> there.
>>>
>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>> CQE to be overflown. Let me double check
>>
>> Yeah this is very possible, the overflow checking could be broken in
>> there. I'll poke at it and report back.
> 
> It does, this should fix it:
> 
> 
> diff --git a/test/read-mshot.c b/test/read-mshot.c
> index 8fcb79857bf0..501ca69a98dc 100644
> --- a/test/read-mshot.c
> +++ b/test/read-mshot.c
> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>   		}
>   		if (!(cqe->flags & IORING_CQE_F_MORE)) {
>   			/* we expect this on overflow */
> -			if (overflow && (i - 1 == NR_OVERFLOW))
> +			if (overflow && i >= NR_OVERFLOW)

Which is not ideal either, e.g. I wouldn't mind if the kernel stops
one entry before CQ is full, so that the request can complete w/o
overflowing. Not supposing the change because it's a marginal
case, but we shouldn't limit ourselves.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:44           ` Pavel Begunkov
@ 2024-03-15 16:49             ` Jens Axboe
  2024-03-15 17:26               ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 16:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 10:44 AM, Pavel Begunkov wrote:
> On 3/15/24 16:27, Jens Axboe wrote:
>> On 3/15/24 10:25 AM, Jens Axboe wrote:
>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>>> completion lock/flush batching.
>>>>>>
>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>>> multishot requests, so have conditional locking with deferred flush
>>>>>> for them.
>>>>>
>>>>> This breaks the read-mshot test case, looking into what is going on
>>>>> there.
>>>>
>>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>>> CQE to be overflown. Let me double check
>>>
>>> Yeah this is very possible, the overflow checking could be broken in
>>> there. I'll poke at it and report back.
>>
>> It does, this should fix it:
>>
>>
>> diff --git a/test/read-mshot.c b/test/read-mshot.c
>> index 8fcb79857bf0..501ca69a98dc 100644
>> --- a/test/read-mshot.c
>> +++ b/test/read-mshot.c
>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>>           }
>>           if (!(cqe->flags & IORING_CQE_F_MORE)) {
>>               /* we expect this on overflow */
>> -            if (overflow && (i - 1 == NR_OVERFLOW))
>> +            if (overflow && i >= NR_OVERFLOW)
> 
> Which is not ideal either, e.g. I wouldn't mind if the kernel stops
> one entry before CQ is full, so that the request can complete w/o
> overflowing. Not supposing the change because it's a marginal
> case, but we shouldn't limit ourselves.

But if the event keeps triggering we have to keep posting CQEs,
otherwise we could get stuck. As far as I'm concerned, the behavior with
the patch looks correct. The last CQE is overflown, and that terminates
it, and it doesn't have MORE set. The one before that has MORE set, but
it has to, unless you aborted it early. But that seems impossible,
because what if that was indeed the last current CQE, and we reap CQEs
before the next one is posted.

So unless I'm missing something, I don't think we can be doing any
better.
-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 16:49             ` Jens Axboe
@ 2024-03-15 17:26               ` Pavel Begunkov
  2024-03-15 18:26                 ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 17:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 16:49, Jens Axboe wrote:
> On 3/15/24 10:44 AM, Pavel Begunkov wrote:
>> On 3/15/24 16:27, Jens Axboe wrote:
>>> On 3/15/24 10:25 AM, Jens Axboe wrote:
>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>>>> completion lock/flush batching.
>>>>>>>
>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>>>> multishot requests, so have conditional locking with deferred flush
>>>>>>> for them.
>>>>>>
>>>>>> This breaks the read-mshot test case, looking into what is going on
>>>>>> there.
>>>>>
>>>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>>>> CQE to be overflown. Let me double check
>>>>
>>>> Yeah this is very possible, the overflow checking could be broken in
>>>> there. I'll poke at it and report back.
>>>
>>> It does, this should fix it:
>>>
>>>
>>> diff --git a/test/read-mshot.c b/test/read-mshot.c
>>> index 8fcb79857bf0..501ca69a98dc 100644
>>> --- a/test/read-mshot.c
>>> +++ b/test/read-mshot.c
>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>>>            }
>>>            if (!(cqe->flags & IORING_CQE_F_MORE)) {
>>>                /* we expect this on overflow */
>>> -            if (overflow && (i - 1 == NR_OVERFLOW))
>>> +            if (overflow && i >= NR_OVERFLOW)
>>
>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops
>> one entry before CQ is full, so that the request can complete w/o
>> overflowing. Not supposing the change because it's a marginal
>> case, but we shouldn't limit ourselves.
> 
> But if the event keeps triggering we have to keep posting CQEs,
> otherwise we could get stuck. 

Or we can complete the request, then the user consumes CQEs
and restarts as usual

> As far as I'm concerned, the behavior with
> the patch looks correct. The last CQE is overflown, and that terminates
> it, and it doesn't have MORE set. The one before that has MORE set, but
> it has to, unless you aborted it early. But that seems impossible,
> because what if that was indeed the last current CQE, and we reap CQEs
> before the next one is posted.
> 
> So unless I'm missing something, I don't think we can be doing any
> better.

You can opportunistically try to avoid overflows, unreliably

bool io_post_cqe() {
	// Not enough space in the CQ left, so if there is a next
	// completion pending we'd have to overflow. Avoid that by
	// terminating it now.
	//
	// If there are no more CQEs after this one, we might
	// terminate a bit earlier, but that better because
	// overflows are so expensive and unhandy and so on.
	if (cq_space_left() <= 1)
		return false;
	fill_cqe();
	return true;
}

some_multishot_function(req) {
	if (!io_post_cqe(res))
		complete_req(req, res);
}

Again, not suggesting the change for all the obvious reasons, but
I think semantically we should be able to do it.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 17:26               ` Pavel Begunkov
@ 2024-03-15 18:26                 ` Jens Axboe
  2024-03-15 18:51                   ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 18:26 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 11:26 AM, Pavel Begunkov wrote:
> On 3/15/24 16:49, Jens Axboe wrote:
>> On 3/15/24 10:44 AM, Pavel Begunkov wrote:
>>> On 3/15/24 16:27, Jens Axboe wrote:
>>>> On 3/15/24 10:25 AM, Jens Axboe wrote:
>>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>>>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>>>>> completion lock/flush batching.
>>>>>>>>
>>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>>>>> multishot requests, so have conditional locking with deferred flush
>>>>>>>> for them.
>>>>>>>
>>>>>>> This breaks the read-mshot test case, looking into what is going on
>>>>>>> there.
>>>>>>
>>>>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>>>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>>>>> CQE to be overflown. Let me double check
>>>>>
>>>>> Yeah this is very possible, the overflow checking could be broken in
>>>>> there. I'll poke at it and report back.
>>>>
>>>> It does, this should fix it:
>>>>
>>>>
>>>> diff --git a/test/read-mshot.c b/test/read-mshot.c
>>>> index 8fcb79857bf0..501ca69a98dc 100644
>>>> --- a/test/read-mshot.c
>>>> +++ b/test/read-mshot.c
>>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>>>>            }
>>>>            if (!(cqe->flags & IORING_CQE_F_MORE)) {
>>>>                /* we expect this on overflow */
>>>> -            if (overflow && (i - 1 == NR_OVERFLOW))
>>>> +            if (overflow && i >= NR_OVERFLOW)
>>>
>>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops
>>> one entry before CQ is full, so that the request can complete w/o
>>> overflowing. Not supposing the change because it's a marginal
>>> case, but we shouldn't limit ourselves.
>>
>> But if the event keeps triggering we have to keep posting CQEs,
>> otherwise we could get stuck. 
> 
> Or we can complete the request, then the user consumes CQEs
> and restarts as usual

So you'd want to track if we'd overflow, wait for overflow to clear, and
then restart that request? I think that sounds a bit involved, no?
Particularly for a case like overflow, which generally should not occur.
If it does, just terminate it, and have the user re-issue it. That seems
like the simpler and better solution to me.

>> As far as I'm concerned, the behavior with
>> the patch looks correct. The last CQE is overflown, and that terminates
>> it, and it doesn't have MORE set. The one before that has MORE set, but
>> it has to, unless you aborted it early. But that seems impossible,
>> because what if that was indeed the last current CQE, and we reap CQEs
>> before the next one is posted.
>>
>> So unless I'm missing something, I don't think we can be doing any
>> better.
> 
> You can opportunistically try to avoid overflows, unreliably
> 
> bool io_post_cqe() {
>     // Not enough space in the CQ left, so if there is a next
>     // completion pending we'd have to overflow. Avoid that by
>     // terminating it now.
>     //
>     // If there are no more CQEs after this one, we might
>     // terminate a bit earlier, but that better because
>     // overflows are so expensive and unhandy and so on.
>     if (cq_space_left() <= 1)
>         return false;
>     fill_cqe();
>     return true;
> }
> 
> some_multishot_function(req) {
>     if (!io_post_cqe(res))
>         complete_req(req, res);
> }
> 
> Again, not suggesting the change for all the obvious reasons, but
> I think semantically we should be able to do it.

Yeah not convinced this is worth looking at. If it was the case that the
hot path would often see overflows and it'd help to avoid it, then
probably it'd make sense. But I don't think that's the case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 18:26                 ` Jens Axboe
@ 2024-03-15 18:51                   ` Pavel Begunkov
  2024-03-15 19:02                     ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-15 18:51 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 18:26, Jens Axboe wrote:
> On 3/15/24 11:26 AM, Pavel Begunkov wrote:
>> On 3/15/24 16:49, Jens Axboe wrote:
>>> On 3/15/24 10:44 AM, Pavel Begunkov wrote:
>>>> On 3/15/24 16:27, Jens Axboe wrote:
>>>>> On 3/15/24 10:25 AM, Jens Axboe wrote:
>>>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>>>>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>>>>>> completion lock/flush batching.
>>>>>>>>>
>>>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>>>>>> multishot requests, so have conditional locking with deferred flush
>>>>>>>>> for them.
>>>>>>>>
>>>>>>>> This breaks the read-mshot test case, looking into what is going on
>>>>>>>> there.
>>>>>>>
>>>>>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>>>>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>>>>>> CQE to be overflown. Let me double check
>>>>>>
>>>>>> Yeah this is very possible, the overflow checking could be broken in
>>>>>> there. I'll poke at it and report back.
>>>>>
>>>>> It does, this should fix it:
>>>>>
>>>>>
>>>>> diff --git a/test/read-mshot.c b/test/read-mshot.c
>>>>> index 8fcb79857bf0..501ca69a98dc 100644
>>>>> --- a/test/read-mshot.c
>>>>> +++ b/test/read-mshot.c
>>>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>>>>>             }
>>>>>             if (!(cqe->flags & IORING_CQE_F_MORE)) {
>>>>>                 /* we expect this on overflow */
>>>>> -            if (overflow && (i - 1 == NR_OVERFLOW))
>>>>> +            if (overflow && i >= NR_OVERFLOW)
>>>>
>>>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops
>>>> one entry before CQ is full, so that the request can complete w/o
>>>> overflowing. Not supposing the change because it's a marginal
>>>> case, but we shouldn't limit ourselves.
>>>
>>> But if the event keeps triggering we have to keep posting CQEs,
>>> otherwise we could get stuck.
>>
>> Or we can complete the request, then the user consumes CQEs
>> and restarts as usual
> 
> So you'd want to track if we'd overflow, wait for overflow to clear, and
> then restart that request?

No, the 2 line change in io_post_cqe() from the last email's
snippet is the only thing you'd need.

I probably don't understand why and what tracking you mean, but
fwiw we currently do track and account for overflows.


/* For defered completions this is not as strict as it is otherwise,
  * however it's main job is to prevent unbounded posted completions,
  * and in that it works just as well.
  */
if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq))
	return false;


which is being killed in the series.

> I think that sounds a bit involved, no?
> Particularly for a case like overflow, which generally should not occur.
> If it does, just terminate it, and have the user re-issue it. That seems
> like the simpler and better solution to me.
> 
>>> As far as I'm concerned, the behavior with
>>> the patch looks correct. The last CQE is overflown, and that terminates
>>> it, and it doesn't have MORE set. The one before that has MORE set, but
>>> it has to, unless you aborted it early. But that seems impossible,
>>> because what if that was indeed the last current CQE, and we reap CQEs
>>> before the next one is posted.
>>>
>>> So unless I'm missing something, I don't think we can be doing any
>>> better.
>>
>> You can opportunistically try to avoid overflows, unreliably
>>
>> bool io_post_cqe() {
>>      // Not enough space in the CQ left, so if there is a next
>>      // completion pending we'd have to overflow. Avoid that by
>>      // terminating it now.
>>      //
>>      // If there are no more CQEs after this one, we might
>>      // terminate a bit earlier, but that better because
>>      // overflows are so expensive and unhandy and so on.
>>      if (cq_space_left() <= 1)
>>          return false;
>>      fill_cqe();
>>      return true;
>> }
>>
>> some_multishot_function(req) {
>>      if (!io_post_cqe(res))
>>          complete_req(req, res);
>> }
>>
>> Again, not suggesting the change for all the obvious reasons, but
>> I think semantically we should be able to do it.
> 
> Yeah not convinced this is worth looking at. If it was the case that the
> hot path would often see overflows and it'd help to avoid it, then
> probably it'd make sense. But I don't think that's the case.

We're talking about different things. Seems you're discussing a
particular implementation, its constraints and performance. I care
purely about the semantics, the implicit uapi. And I define it as
"multishot requests may decide to terminate at any point, the user
should expect it and reissue when appropriate", not restricting it
to "can only (normally) terminate when CQ is full".

We're changing tests from time to time, but the there is that
"behaviour defines semantics", especially when it wasn't clear
in advance and breaks someone's app, and people might be using
assumptions in tests as the universal truth.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches
  2024-03-15 18:51                   ` Pavel Begunkov
@ 2024-03-15 19:02                     ` Jens Axboe
  0 siblings, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 19:02 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-block, Kanchan Joshi, Ming Lei

On 3/15/24 12:51 PM, Pavel Begunkov wrote:
> On 3/15/24 18:26, Jens Axboe wrote:
>> On 3/15/24 11:26 AM, Pavel Begunkov wrote:
>>> On 3/15/24 16:49, Jens Axboe wrote:
>>>> On 3/15/24 10:44 AM, Pavel Begunkov wrote:
>>>>> On 3/15/24 16:27, Jens Axboe wrote:
>>>>>> On 3/15/24 10:25 AM, Jens Axboe wrote:
>>>>>>> On 3/15/24 10:23 AM, Pavel Begunkov wrote:
>>>>>>>> On 3/15/24 16:20, Jens Axboe wrote:
>>>>>>>>> On 3/15/24 9:30 AM, Pavel Begunkov wrote:
>>>>>>>>>> io_post_aux_cqe(), which is used for multishot requests, delays
>>>>>>>>>> completions by putting CQEs into a temporary array for the purpose
>>>>>>>>>> completion lock/flush batching.
>>>>>>>>>>
>>>>>>>>>> DEFER_TASKRUN doesn't need any locking, so for it we can put completions
>>>>>>>>>> directly into the CQ and defer post completion handling with a flag.
>>>>>>>>>> That leaves !DEFER_TASKRUN, which is not that interesting / hot for
>>>>>>>>>> multishot requests, so have conditional locking with deferred flush
>>>>>>>>>> for them.
>>>>>>>>>
>>>>>>>>> This breaks the read-mshot test case, looking into what is going on
>>>>>>>>> there.
>>>>>>>>
>>>>>>>> I forgot to mention, yes it does, the test makes odd assumptions about
>>>>>>>> overflows, IIRC it expects that the kernel allows one and only one aux
>>>>>>>> CQE to be overflown. Let me double check
>>>>>>>
>>>>>>> Yeah this is very possible, the overflow checking could be broken in
>>>>>>> there. I'll poke at it and report back.
>>>>>>
>>>>>> It does, this should fix it:
>>>>>>
>>>>>>
>>>>>> diff --git a/test/read-mshot.c b/test/read-mshot.c
>>>>>> index 8fcb79857bf0..501ca69a98dc 100644
>>>>>> --- a/test/read-mshot.c
>>>>>> +++ b/test/read-mshot.c
>>>>>> @@ -236,7 +236,7 @@ static int test(int first_good, int async, int overflow)
>>>>>>             }
>>>>>>             if (!(cqe->flags & IORING_CQE_F_MORE)) {
>>>>>>                 /* we expect this on overflow */
>>>>>> -            if (overflow && (i - 1 == NR_OVERFLOW))
>>>>>> +            if (overflow && i >= NR_OVERFLOW)
>>>>>
>>>>> Which is not ideal either, e.g. I wouldn't mind if the kernel stops
>>>>> one entry before CQ is full, so that the request can complete w/o
>>>>> overflowing. Not supposing the change because it's a marginal
>>>>> case, but we shouldn't limit ourselves.
>>>>
>>>> But if the event keeps triggering we have to keep posting CQEs,
>>>> otherwise we could get stuck.
>>>
>>> Or we can complete the request, then the user consumes CQEs
>>> and restarts as usual
>>
>> So you'd want to track if we'd overflow, wait for overflow to clear, and
>> then restart that request?
> 
> No, the 2 line change in io_post_cqe() from the last email's
> snippet is the only thing you'd need.

Ah now I follow, so you're still terminating it, just one before
overflow rather than letting it overflow. Yes I agree that makes more
sense! It could still terminate early though, if the application reaps
CQEs before another one is posted. So I think the opportunistic early
termination is probably best ignored.

Or the app could always be using the full size of the CQ ring, and never
above. With the change, it'd still terminate every CQ-ring-size
requests, even though it need not.

IOW, I think we just leave it as-is, no? Neither of these should cause
an app issue, as CQE_F_MORE is the one driving terminations. But you
could have more unneeded terminations, even if that may be far fetched.
Though you never know, an opportunistically terminating with a known
racy check seems like something we should not do.

> I probably don't understand why and what tracking you mean, but
> fwiw we currently do track and account for overflows.

Misunderstanding, I thought you'd still post with CQE_F_MORE and need to
restart it. But that wasn't the case.

>> I think that sounds a bit involved, no?
>> Particularly for a case like overflow, which generally should not occur.
>> If it does, just terminate it, and have the user re-issue it. That seems
>> like the simpler and better solution to me.
>>
>>>> As far as I'm concerned, the behavior with
>>>> the patch looks correct. The last CQE is overflown, and that terminates
>>>> it, and it doesn't have MORE set. The one before that has MORE set, but
>>>> it has to, unless you aborted it early. But that seems impossible,
>>>> because what if that was indeed the last current CQE, and we reap CQEs
>>>> before the next one is posted.
>>>>
>>>> So unless I'm missing something, I don't think we can be doing any
>>>> better.
>>>
>>> You can opportunistically try to avoid overflows, unreliably
>>>
>>> bool io_post_cqe() {
>>>      // Not enough space in the CQ left, so if there is a next
>>>      // completion pending we'd have to overflow. Avoid that by
>>>      // terminating it now.
>>>      //
>>>      // If there are no more CQEs after this one, we might
>>>      // terminate a bit earlier, but that better because
>>>      // overflows are so expensive and unhandy and so on.
>>>      if (cq_space_left() <= 1)
>>>          return false;
>>>      fill_cqe();
>>>      return true;
>>> }
>>>
>>> some_multishot_function(req) {
>>>      if (!io_post_cqe(res))
>>>          complete_req(req, res);
>>> }
>>>
>>> Again, not suggesting the change for all the obvious reasons, but
>>> I think semantically we should be able to do it.
>>
>> Yeah not convinced this is worth looking at. If it was the case that the
>> hot path would often see overflows and it'd help to avoid it, then
>> probably it'd make sense. But I don't think that's the case.
> 
> We're talking about different things. Seems you're discussing a
> particular implementation, its constraints and performance. I care
> purely about the semantics, the implicit uapi. And I define it as
> "multishot requests may decide to terminate at any point, the user
> should expect it and reissue when appropriate", not restricting it
> to "can only (normally) terminate when CQ is full".

Yep fully agree, I think it was largely a talking past each other on
exactly what it'd do.

> We're changing tests from time to time, but the there is that
> "behaviour defines semantics", especially when it wasn't clear
> in advance and breaks someone's app, and people might be using
> assumptions in tests as the universal truth.

Agree, any time a test needs changing, it should be cause for extra
thinking in terms of whether this will have application impacts as well.
In general, the tests are overly anal, and sometimes they do end up
testing implementation details. The API is pretty clear in this regard,
if you see CQE_F_MORE, then you get more completions. If you don't, the
request has terminated. The change doesn't really impact that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
                   ` (12 preceding siblings ...)
  2024-03-15 16:00 ` Jens Axboe
@ 2024-03-15 22:53 ` Jens Axboe
  2024-03-16  2:03   ` Ming Lei
  2024-03-16 11:52   ` Ming Lei
  13 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-15 22:53 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov; +Cc: linux-block, Kanchan Joshi, Ming Lei


On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> Patch 1 is a fix.
> 
> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> misundertsandings of the flags and of the tw state. It'd be great to have
> even without even w/o the rest.
> 
> 8-11 mandate ctx locking for task_work and finally removes the CQE
> caches, instead we post directly into the CQ. Note that the cache is
> used by multishot auxiliary completions.
> 
> [...]

Applied, thanks!

[02/11] io_uring/cmd: kill one issue_flags to tw conversion
        commit: 31ab0342cf6434e1e2879d12f0526830ce97365d
[03/11] io_uring/cmd: fix tw <-> issue_flags conversion
        commit: b48f3e29b89055894b3f50c657658c325b5b49fd
[04/11] io_uring/cmd: introduce io_uring_cmd_complete
        commit: c5b4c92ca69215c0af17e4e9d8c84c8942f3257d
[05/11] ublk: don't hard code IO_URING_F_UNLOCKED
        commit: c54cfb81fe1774231fca952eff928389bfc3b2e3
[06/11] nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
        commit: 800a90681f3c3383660a8e3e2d279e0f056afaee
[07/11] io_uring/rw: avoid punting to io-wq directly
        commit: 56d565d54373c17b7620fc605c899c41968e48d0
[08/11] io_uring: force tw ctx locking
        commit: f087cdd065af0418ffc8a9ed39eadc93347efdd5
[09/11] io_uring: remove struct io_tw_state::locked
        commit: 339f8d66e996ec52b47221448ff4b3534cc9a58d
[10/11] io_uring: refactor io_fill_cqe_req_aux
        commit: 7b31c3964b769a6a16c4e414baa8094b441e498e
[11/11] io_uring: get rid of intermediate aux cqe caches
        commit: 5a475a1f47412a44ed184aac04b9ff0aeaa31d65

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-15 22:53 ` (subset) " Jens Axboe
@ 2024-03-16  2:03   ` Ming Lei
  2024-03-16  2:24     ` Ming Lei
  2024-03-16 11:52   ` Ming Lei
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-16  2:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, linux-block, Kanchan Joshi

On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> 
> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> > Patch 1 is a fix.
> > 
> > Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> > misundertsandings of the flags and of the tw state. It'd be great to have
> > even without even w/o the rest.
> > 
> > 8-11 mandate ctx locking for task_work and finally removes the CQE
> > caches, instead we post directly into the CQ. Note that the cache is
> > used by multishot auxiliary completions.
> > 
> > [...]
> 
> Applied, thanks!

Hi Jens and Pavel,

Looks this patch causes hang when running './check ublk/002' in blktests.

Steps:

1) cargo install rublk

2) cd blktests

3) ./check ublk/002



Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  2:03   ` Ming Lei
@ 2024-03-16  2:24     ` Ming Lei
  2024-03-16  2:54       ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-16  2:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, linux-block, Kanchan Joshi

On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
>
> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> >
> > On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> > > Patch 1 is a fix.
> > >
> > > Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> > > misundertsandings of the flags and of the tw state. It'd be great to have
> > > even without even w/o the rest.
> > >
> > > 8-11 mandate ctx locking for task_work and finally removes the CQE
> > > caches, instead we post directly into the CQ. Note that the cache is
> > > used by multishot auxiliary completions.
> > >
> > > [...]
> >
> > Applied, thanks!
>
> Hi Jens and Pavel,
>
> Looks this patch causes hang when running './check ublk/002' in blktests.

Not take close look, and  I guess it hangs in

io_uring_cmd_del_cancelable() -> io_ring_submit_lock

[root@ktest-36 ~]# cat /proc/1420/stack
[<0>] io_uring_cmd_done+0x161/0x1c0
[<0>] ublk_stop_dev+0x10e/0x1b0 [ublk_drv]
[<0>] ublk_ctrl_uring_cmd+0xbc9/0x11e0 [ublk_drv]
[<0>] io_uring_cmd+0x9e/0x130
[<0>] io_issue_sqe+0x2d3/0x730
[<0>] io_wq_submit_work+0xd2/0x350
[<0>] io_worker_handle_work+0x12a/0x4b0
[<0>] io_wq_worker+0x101/0x390
[<0>] ret_from_fork+0x31/0x50
[<0>] ret_from_fork_asm+0x1a/0x30

(gdb) l *(io_uring_cmd_done+0x161)
0xffffffff817ed241 is in io_uring_cmd_done (./include/linux/list.h:985).
980 return !READ_ONCE(h->first);
981 }
982
983 static inline void __hlist_del(struct hlist_node *n)
984 {
985 struct hlist_node *next = n->next;
986 struct hlist_node **pprev = n->pprev;
987
988 WRITE_ONCE(*pprev, next);
989 if (next)


Thanks,


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  2:24     ` Ming Lei
@ 2024-03-16  2:54       ` Pavel Begunkov
  2024-03-16  3:54         ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-16  2:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/16/24 02:24, Ming Lei wrote:
> On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
>>
>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>
>>> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
>>>> Patch 1 is a fix.
>>>>
>>>> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
>>>> misundertsandings of the flags and of the tw state. It'd be great to have
>>>> even without even w/o the rest.
>>>>
>>>> 8-11 mandate ctx locking for task_work and finally removes the CQE
>>>> caches, instead we post directly into the CQ. Note that the cache is
>>>> used by multishot auxiliary completions.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>
>> Hi Jens and Pavel,
>>
>> Looks this patch causes hang when running './check ublk/002' in blktests.
> 
> Not take close look, and  I guess it hangs in
> 
> io_uring_cmd_del_cancelable() -> io_ring_submit_lock

Thanks, the trace doesn't completely explains it, but my blind spot
was io_uring_cmd_done() potentially grabbing the mutex. They're
supposed to be irq safe mimicking io_req_task_work_add(), that's how
nvme passthrough uses it as well (but at least it doesn't need the
cancellation bits).

One option is to replace it with a spinlock, the other is to delay
the io_uring_cmd_del_cancelable() call to the task_work callback.
The latter would be cleaner and more preferable, but I'm lacking
context to tell if that would be correct. Ming, what do you think?


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  2:54       ` Pavel Begunkov
@ 2024-03-16  3:54         ` Ming Lei
  2024-03-16  4:13           ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-16  3:54 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi, ming.lei

On Sat, Mar 16, 2024 at 02:54:19AM +0000, Pavel Begunkov wrote:
> On 3/16/24 02:24, Ming Lei wrote:
> > On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
> > > 
> > > On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> > > > 
> > > > On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> > > > > Patch 1 is a fix.
> > > > > 
> > > > > Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> > > > > misundertsandings of the flags and of the tw state. It'd be great to have
> > > > > even without even w/o the rest.
> > > > > 
> > > > > 8-11 mandate ctx locking for task_work and finally removes the CQE
> > > > > caches, instead we post directly into the CQ. Note that the cache is
> > > > > used by multishot auxiliary completions.
> > > > > 
> > > > > [...]
> > > > 
> > > > Applied, thanks!
> > > 
> > > Hi Jens and Pavel,
> > > 
> > > Looks this patch causes hang when running './check ublk/002' in blktests.
> > 
> > Not take close look, and  I guess it hangs in
> > 
> > io_uring_cmd_del_cancelable() -> io_ring_submit_lock
> 
> Thanks, the trace doesn't completely explains it, but my blind spot
> was io_uring_cmd_done() potentially grabbing the mutex. They're
> supposed to be irq safe mimicking io_req_task_work_add(), that's how
> nvme passthrough uses it as well (but at least it doesn't need the
> cancellation bits).
> 
> One option is to replace it with a spinlock, the other is to delay
> the io_uring_cmd_del_cancelable() call to the task_work callback.
> The latter would be cleaner and more preferable, but I'm lacking
> context to tell if that would be correct. Ming, what do you think?

I prefer to the latter approach because the two cancelable helpers are
run in fast path.

Looks all new io_uring_cmd_complete() in ublk have this issue, and the
following patch should avoid them all.

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 97dceecadab2..1f54da0e655c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1417,6 +1417,12 @@ static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
 	return true;
 }
 
+static void ublk_cancel_cmd_cb(struct io_uring_cmd *cmd,
+		unsigned int issue_flags)
+{
+	io_uring_cmd_done(cmd, UBLK_IO_RES_ABORT, 0, issue_flags);
+}
+
 static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io)
 {
 	bool done;
@@ -1431,7 +1437,7 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io)
 	spin_unlock(&ubq->cancel_lock);
 
 	if (!done)
-		io_uring_cmd_complete(io->cmd, UBLK_IO_RES_ABORT, 0);
+		io_uring_cmd_complete_in_task(io->cmd, ublk_cancel_cmd_cb);
 }
 
 /*
@@ -1775,10 +1781,9 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	return -EIOCBQUEUED;
 
  out:
-	io_uring_cmd_complete(cmd, ret, 0);
 	pr_devel("%s: complete: cmd op %d, tag %d ret %x io_flags %x\n",
 			__func__, cmd_op, tag, ret, io->flags);
-	return -EIOCBQUEUED;
+	return ret;
 }
 
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
@@ -2928,10 +2933,9 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 	if (ub)
 		ublk_put_device(ub);
  out:
-	io_uring_cmd_complete(cmd, ret, 0);
 	pr_devel("%s: cmd done ret %d cmd_op %x, dev id %d qid %d\n",
 			__func__, ret, cmd->cmd_op, header->dev_id, header->queue_id);
-	return -EIOCBQUEUED;
+	return ret;
 }
 
 static const struct file_operations ublk_ctl_fops = {



Thanks,
Ming


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  3:54         ` Ming Lei
@ 2024-03-16  4:13           ` Pavel Begunkov
  2024-03-16  4:20             ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-16  4:13 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi

On 3/16/24 03:54, Ming Lei wrote:
> On Sat, Mar 16, 2024 at 02:54:19AM +0000, Pavel Begunkov wrote:
>> On 3/16/24 02:24, Ming Lei wrote:
>>> On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
>>>>
>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>
>>>>> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
>>>>>> Patch 1 is a fix.
>>>>>>
>>>>>> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
>>>>>> misundertsandings of the flags and of the tw state. It'd be great to have
>>>>>> even without even w/o the rest.
>>>>>>
>>>>>> 8-11 mandate ctx locking for task_work and finally removes the CQE
>>>>>> caches, instead we post directly into the CQ. Note that the cache is
>>>>>> used by multishot auxiliary completions.
>>>>>>
>>>>>> [...]
>>>>>
>>>>> Applied, thanks!
>>>>
>>>> Hi Jens and Pavel,
>>>>
>>>> Looks this patch causes hang when running './check ublk/002' in blktests.
>>>
>>> Not take close look, and  I guess it hangs in
>>>
>>> io_uring_cmd_del_cancelable() -> io_ring_submit_lock
>>
>> Thanks, the trace doesn't completely explains it, but my blind spot
>> was io_uring_cmd_done() potentially grabbing the mutex. They're
>> supposed to be irq safe mimicking io_req_task_work_add(), that's how
>> nvme passthrough uses it as well (but at least it doesn't need the
>> cancellation bits).
>>
>> One option is to replace it with a spinlock, the other is to delay
>> the io_uring_cmd_del_cancelable() call to the task_work callback.
>> The latter would be cleaner and more preferable, but I'm lacking
>> context to tell if that would be correct. Ming, what do you think?
> 
> I prefer to the latter approach because the two cancelable helpers are
> run in fast path.
> 
> Looks all new io_uring_cmd_complete() in ublk have this issue, and the
> following patch should avoid them all.

The one I have in mind on top of the current tree would be like below.
Untested, and doesn't allow this cancellation thing for iopoll. I'll
prepare something tomorrow.


diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e45d4cd5ef82..000ba435451c 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -14,19 +14,15 @@
  #include "rsrc.h"
  #include "uring_cmd.h"
  
-static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd,
-		unsigned int issue_flags)
+static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd)
  {
  	struct io_kiocb *req = cmd_to_io_kiocb(cmd);
-	struct io_ring_ctx *ctx = req->ctx;
  
  	if (!(cmd->flags & IORING_URING_CMD_CANCELABLE))
  		return;
  
  	cmd->flags &= ~IORING_URING_CMD_CANCELABLE;
-	io_ring_submit_lock(ctx, issue_flags);
  	hlist_del(&req->hash_node);
-	io_ring_submit_unlock(ctx, issue_flags);
  }
  
  /*
@@ -80,6 +76,15 @@ static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
  	req->big_cqe.extra2 = extra2;
  }
  
+static void io_req_task_cmd_complete(struct io_kiocb *req,
+				     struct io_tw_state *ts)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+
+	io_uring_cmd_del_cancelable(ioucmd);
+	io_req_task_complete(req, ts);
+}
+
  /*
   * Called by consumers of io_uring_cmd, if they originally returned
   * -EIOCBQUEUED upon receiving the command.
@@ -89,8 +94,6 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
  {
  	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
  
-	io_uring_cmd_del_cancelable(ioucmd, issue_flags);
-
  	if (ret < 0)
  		req_set_fail(req);
  
@@ -105,7 +108,7 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
  			return;
  		io_req_complete_defer(req);
  	} else {
-		req->io_task_work.func = io_req_task_complete;
+		req->io_task_work.func = io_req_task_cmd_complete;
  		io_req_task_work_add(req);
  	}
  }

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  4:13           ` Pavel Begunkov
@ 2024-03-16  4:20             ` Pavel Begunkov
  2024-03-16  9:53               ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-16  4:20 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi

On 3/16/24 04:13, Pavel Begunkov wrote:
> On 3/16/24 03:54, Ming Lei wrote:
>> On Sat, Mar 16, 2024 at 02:54:19AM +0000, Pavel Begunkov wrote:
>>> On 3/16/24 02:24, Ming Lei wrote:
>>>> On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
>>>>>
>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>
>>>>>> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
>>>>>>> Patch 1 is a fix.
>>>>>>>
>>>>>>> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
>>>>>>> misundertsandings of the flags and of the tw state. It'd be great to have
>>>>>>> even without even w/o the rest.
>>>>>>>
>>>>>>> 8-11 mandate ctx locking for task_work and finally removes the CQE
>>>>>>> caches, instead we post directly into the CQ. Note that the cache is
>>>>>>> used by multishot auxiliary completions.
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>> Applied, thanks!
>>>>>
>>>>> Hi Jens and Pavel,
>>>>>
>>>>> Looks this patch causes hang when running './check ublk/002' in blktests.
>>>>
>>>> Not take close look, and  I guess it hangs in
>>>>
>>>> io_uring_cmd_del_cancelable() -> io_ring_submit_lock
>>>
>>> Thanks, the trace doesn't completely explains it, but my blind spot
>>> was io_uring_cmd_done() potentially grabbing the mutex. They're
>>> supposed to be irq safe mimicking io_req_task_work_add(), that's how
>>> nvme passthrough uses it as well (but at least it doesn't need the
>>> cancellation bits).
>>>
>>> One option is to replace it with a spinlock, the other is to delay
>>> the io_uring_cmd_del_cancelable() call to the task_work callback.
>>> The latter would be cleaner and more preferable, but I'm lacking
>>> context to tell if that would be correct. Ming, what do you think?
>>
>> I prefer to the latter approach because the two cancelable helpers are
>> run in fast path.
>>
>> Looks all new io_uring_cmd_complete() in ublk have this issue, and the
>> following patch should avoid them all.
> 
> The one I have in mind on top of the current tree would be like below.
> Untested, and doesn't allow this cancellation thing for iopoll. I'll
> prepare something tomorrow.
> 
> 
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index e45d4cd5ef82..000ba435451c 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -14,19 +14,15 @@
>   #include "rsrc.h"
>   #include "uring_cmd.h"
> 
> -static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd,
> -        unsigned int issue_flags)
> +static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd)
>   {
>       struct io_kiocb *req = cmd_to_io_kiocb(cmd);
> -    struct io_ring_ctx *ctx = req->ctx;
> 
>       if (!(cmd->flags & IORING_URING_CMD_CANCELABLE))
>           return;
> 
>       cmd->flags &= ~IORING_URING_CMD_CANCELABLE;
> -    io_ring_submit_lock(ctx, issue_flags);
>       hlist_del(&req->hash_node);
> -    io_ring_submit_unlock(ctx, issue_flags);
>   }
> 
>   /*
> @@ -80,6 +76,15 @@ static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
>       req->big_cqe.extra2 = extra2;
>   }
> 
> +static void io_req_task_cmd_complete(struct io_kiocb *req,
> +                     struct io_tw_state *ts)
> +{
> +    struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +
> +    io_uring_cmd_del_cancelable(ioucmd);
> +    io_req_task_complete(req, ts);
> +}
> +
>   /*
>    * Called by consumers of io_uring_cmd, if they originally returned
>    * -EIOCBQUEUED upon receiving the command.
> @@ -89,8 +94,6 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
>   {
>       struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> 
> -    io_uring_cmd_del_cancelable(ioucmd, issue_flags);
> -
>       if (ret < 0)
>           req_set_fail(req);
> 
> @@ -105,7 +108,7 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
>               return;

Not very well thought through... Here should be a *del_cancelable call
as well

>           io_req_complete_defer(req);
>       } else {
> -        req->io_task_work.func = io_req_task_complete;
> +        req->io_task_work.func = io_req_task_cmd_complete;
>           io_req_task_work_add(req);
>       }
>   }
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16  4:20             ` Pavel Begunkov
@ 2024-03-16  9:53               ` Ming Lei
  0 siblings, 0 replies; 54+ messages in thread
From: Ming Lei @ 2024-03-16  9:53 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi

On Sat, Mar 16, 2024 at 04:20:25AM +0000, Pavel Begunkov wrote:
> On 3/16/24 04:13, Pavel Begunkov wrote:
> > On 3/16/24 03:54, Ming Lei wrote:
> > > On Sat, Mar 16, 2024 at 02:54:19AM +0000, Pavel Begunkov wrote:
> > > > On 3/16/24 02:24, Ming Lei wrote:
> > > > > On Sat, Mar 16, 2024 at 10:04 AM Ming Lei <[email protected]> wrote:
> > > > > > 
> > > > > > On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> > > > > > > 
> > > > > > > On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> > > > > > > > Patch 1 is a fix.
> > > > > > > > 
> > > > > > > > Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> > > > > > > > misundertsandings of the flags and of the tw state. It'd be great to have
> > > > > > > > even without even w/o the rest.
> > > > > > > > 
> > > > > > > > 8-11 mandate ctx locking for task_work and finally removes the CQE
> > > > > > > > caches, instead we post directly into the CQ. Note that the cache is
> > > > > > > > used by multishot auxiliary completions.
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > 
> > > > > > > Applied, thanks!
> > > > > > 
> > > > > > Hi Jens and Pavel,
> > > > > > 
> > > > > > Looks this patch causes hang when running './check ublk/002' in blktests.
> > > > > 
> > > > > Not take close look, and  I guess it hangs in
> > > > > 
> > > > > io_uring_cmd_del_cancelable() -> io_ring_submit_lock
> > > > 
> > > > Thanks, the trace doesn't completely explains it, but my blind spot
> > > > was io_uring_cmd_done() potentially grabbing the mutex. They're
> > > > supposed to be irq safe mimicking io_req_task_work_add(), that's how
> > > > nvme passthrough uses it as well (but at least it doesn't need the
> > > > cancellation bits).
> > > > 
> > > > One option is to replace it with a spinlock, the other is to delay
> > > > the io_uring_cmd_del_cancelable() call to the task_work callback.
> > > > The latter would be cleaner and more preferable, but I'm lacking
> > > > context to tell if that would be correct. Ming, what do you think?
> > > 
> > > I prefer to the latter approach because the two cancelable helpers are
> > > run in fast path.
> > > 
> > > Looks all new io_uring_cmd_complete() in ublk have this issue, and the
> > > following patch should avoid them all.
> > 
> > The one I have in mind on top of the current tree would be like below.
> > Untested, and doesn't allow this cancellation thing for iopoll. I'll
> > prepare something tomorrow.
> > 
> > 
> > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> > index e45d4cd5ef82..000ba435451c 100644
> > --- a/io_uring/uring_cmd.c
> > +++ b/io_uring/uring_cmd.c
> > @@ -14,19 +14,15 @@
> >   #include "rsrc.h"
> >   #include "uring_cmd.h"
> > 
> > -static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd,
> > -        unsigned int issue_flags)
> > +static void io_uring_cmd_del_cancelable(struct io_uring_cmd *cmd)
> >   {
> >       struct io_kiocb *req = cmd_to_io_kiocb(cmd);
> > -    struct io_ring_ctx *ctx = req->ctx;
> > 
> >       if (!(cmd->flags & IORING_URING_CMD_CANCELABLE))
> >           return;
> > 
> >       cmd->flags &= ~IORING_URING_CMD_CANCELABLE;
> > -    io_ring_submit_lock(ctx, issue_flags);
> >       hlist_del(&req->hash_node);
> > -    io_ring_submit_unlock(ctx, issue_flags);
> >   }
> > 
> >   /*
> > @@ -80,6 +76,15 @@ static inline void io_req_set_cqe32_extra(struct io_kiocb *req,
> >       req->big_cqe.extra2 = extra2;
> >   }
> > 
> > +static void io_req_task_cmd_complete(struct io_kiocb *req,
> > +                     struct io_tw_state *ts)
> > +{
> > +    struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> > +
> > +    io_uring_cmd_del_cancelable(ioucmd);
> > +    io_req_task_complete(req, ts);
> > +}
> > +
> >   /*
> >    * Called by consumers of io_uring_cmd, if they originally returned
> >    * -EIOCBQUEUED upon receiving the command.
> > @@ -89,8 +94,6 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
> >   {
> >       struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
> > 
> > -    io_uring_cmd_del_cancelable(ioucmd, issue_flags);
> > -
> >       if (ret < 0)
> >           req_set_fail(req);
> > 
> > @@ -105,7 +108,7 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
> >               return;
> 
> Not very well thought through... Here should be a *del_cancelable call
> as well

Thanks for the fix!

The patch works after adding io_uring_cmd_del_cancelable() in the branch of
`else if (issue_flags & IO_URING_F_COMPLETE_DEFER)'.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-15 22:53 ` (subset) " Jens Axboe
  2024-03-16  2:03   ` Ming Lei
@ 2024-03-16 11:52   ` Ming Lei
  2024-03-16 13:27     ` Pavel Begunkov
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-16 11:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, linux-block, Kanchan Joshi

On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> 
> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
> > Patch 1 is a fix.
> > 
> > Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
> > misundertsandings of the flags and of the tw state. It'd be great to have
> > even without even w/o the rest.
> > 
> > 8-11 mandate ctx locking for task_work and finally removes the CQE
> > caches, instead we post directly into the CQ. Note that the cache is
> > used by multishot auxiliary completions.
> > 
> > [...]
> 
> Applied, thanks!
> 
> [02/11] io_uring/cmd: kill one issue_flags to tw conversion
>         commit: 31ab0342cf6434e1e2879d12f0526830ce97365d
> [03/11] io_uring/cmd: fix tw <-> issue_flags conversion
>         commit: b48f3e29b89055894b3f50c657658c325b5b49fd
> [04/11] io_uring/cmd: introduce io_uring_cmd_complete
>         commit: c5b4c92ca69215c0af17e4e9d8c84c8942f3257d
> [05/11] ublk: don't hard code IO_URING_F_UNLOCKED
>         commit: c54cfb81fe1774231fca952eff928389bfc3b2e3
> [06/11] nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
>         commit: 800a90681f3c3383660a8e3e2d279e0f056afaee
> [07/11] io_uring/rw: avoid punting to io-wq directly
>         commit: 56d565d54373c17b7620fc605c899c41968e48d0
> [08/11] io_uring: force tw ctx locking
>         commit: f087cdd065af0418ffc8a9ed39eadc93347efdd5
> [09/11] io_uring: remove struct io_tw_state::locked
>         commit: 339f8d66e996ec52b47221448ff4b3534cc9a58d
> [10/11] io_uring: refactor io_fill_cqe_req_aux
>         commit: 7b31c3964b769a6a16c4e414baa8094b441e498e
> [11/11] io_uring: get rid of intermediate aux cqe caches
>         commit: 5a475a1f47412a44ed184aac04b9ff0aeaa31d65

Hi Jens and Pavel,

The following two error can be triggered with this patchset
when running some ublk stress test(io vs. deletion). And not see
such failures after reverting the 11 patches.

1) error 1

[  318.843517] ------------[ cut here ]------------
[  318.843937] kernel BUG at mm/slub.c:553!
[  318.844235] invalid opcode: 0000 [#1] SMP NOPTI
[  318.844580] CPU: 7 PID: 1475 Comm: kworker/u48:13 Not tainted 6.8.0_io_uring_6.10+ #14
[  318.845133] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  318.845732] Workqueue: events_unbound io_ring_exit_work
[  318.846104] RIP: 0010:__slab_free+0x152/0x2f0
[  318.846434] Code: 00 4c 89 ff e8 ef 41 bc 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff4
[  318.851192] RSP: 0018:ffffb490411abcb0 EFLAGS: 00010246
[  318.851574] RAX: ffff8b0e871e44f0 RBX: fffff113841c7900 RCX: 0000000000200010
[  318.852032] RDX: ffff8b0e871e4400 RSI: fffff113841c7900 RDI: ffffb490411abd20
[  318.852521] RBP: ffffb490411abd50 R08: 0000000000000001 R09: ffffffffa17e4deb
[  318.852981] R10: 0000000000200010 R11: 0000000000000024 R12: ffff8b0e80292c00
[  318.853472] R13: ffff8b0e871e4400 R14: ffff8b0e80292c00 R15: ffffffffa17e4deb
[  318.853911] FS:  0000000000000000(0000) GS:ffff8b13e7b80000(0000) knlGS:0000000000000000
[  318.854448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.854831] CR2: 00007fbce5249298 CR3: 0000000363020002 CR4: 0000000000770ef0
[  318.855291] PKRU: 55555554
[  318.855533] Call Trace:
[  318.855724]  <TASK>
[  318.855898]  ? die+0x36/0x90
[  318.856121]  ? do_trap+0xdd/0x100
[  318.856389]  ? __slab_free+0x152/0x2f0
[  318.856674]  ? do_error_trap+0x6a/0x90
[  318.856939]  ? __slab_free+0x152/0x2f0
[  318.857202]  ? exc_invalid_op+0x50/0x70
[  318.857505]  ? __slab_free+0x152/0x2f0
[  318.857770]  ? asm_exc_invalid_op+0x1a/0x20
[  318.858056]  ? io_req_caches_free+0x9b/0x100
[  318.858439]  ? io_req_caches_free+0x9b/0x100
[  318.858961]  ? __slab_free+0x152/0x2f0
[  318.859466]  ? __memcg_slab_free_hook+0xd9/0x130
[  318.859941]  ? io_req_caches_free+0x9b/0x100
[  318.860395]  kmem_cache_free+0x2eb/0x3b0
[  318.860826]  io_req_caches_free+0x9b/0x100
[  318.861190]  io_ring_exit_work+0x105/0x5c0
[  318.861496]  ? __schedule+0x3d4/0x1510
[  318.861761]  process_one_work+0x181/0x350
[  318.862042]  worker_thread+0x27e/0x390
[  318.862307]  ? __pfx_worker_thread+0x10/0x10
[  318.862621]  kthread+0xbb/0xf0
[  318.862854]  ? __pfx_kthread+0x10/0x10
[  318.863124]  ret_from_fork+0x31/0x50
[  318.863397]  ? __pfx_kthread+0x10/0x10
[  318.863665]  ret_from_fork_asm+0x1a/0x30
[  318.863943]  </TASK>
[  318.864122] Modules linked in: isofs binfmt_misc xfs vfat fat raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support virtio_net net_failover i2g
[  318.865638] ---[ end trace 0000000000000000 ]---
[  318.865966] RIP: 0010:__slab_free+0x152/0x2f0
[  318.866267] Code: 00 4c 89 ff e8 ef 41 bc 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff4
[  318.867622] RSP: 0018:ffffb490411abcb0 EFLAGS: 00010246
[  318.868103] RAX: ffff8b0e871e44f0 RBX: fffff113841c7900 RCX: 0000000000200010
[  318.868602] RDX: ffff8b0e871e4400 RSI: fffff113841c7900 RDI: ffffb490411abd20
[  318.869051] RBP: ffffb490411abd50 R08: 0000000000000001 R09: ffffffffa17e4deb
[  318.869544] R10: 0000000000200010 R11: 0000000000000024 R12: ffff8b0e80292c00
[  318.870028] R13: ffff8b0e871e4400 R14: ffff8b0e80292c00 R15: ffffffffa17e4deb
[  318.870550] FS:  0000000000000000(0000) GS:ffff8b13e7b80000(0000) knlGS:0000000000000000
[  318.871080] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  318.871509] CR2: 00007fbce5249298 CR3: 0000000363020002 CR4: 0000000000770ef0
[  318.871974] PKRU: 55555554

2) error 2

[ 2833.161174] ------------[ cut here ]------------
[ 2833.161527] WARNING: CPU: 11 PID: 22867 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
[ 2833.162114] Modules linked in: isofs binfmt_misc vfat fat xfs raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support i2c_i801 virtio_net i2c_smbus net_failover failover lpc_ich ublk_drv loop zram nvme nvme_core usb_storage crc32c_intel virtio_scsi virtio_blk fuse qemu_fw_cfg
[ 2833.163650] CPU: 11 PID: 22867 Comm: kworker/11:0 Tainted: G      D W          6.8.0_io_uring_6.10+ #14
[ 2833.164289] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 2833.164860] Workqueue: events io_fallback_req_func
[ 2833.165224] RIP: 0010:__put_task_struct+0x10c/0x180
[ 2833.165586] Code: 48 85 d2 74 05 f0 ff 0a 74 44 48 8b 3d d5 b6 c7 02 48 89 ee e8 65 b7 2d 00 eb ac be 03 00 00 00 48 89 ef e8 36 82 70 00 eb 9d <0f> 0b 8b 43 28 85 c0 0f 84 0e ff ff ff 0f 0b 65 48 3b 1d 5d d2 f2
[ 2833.166819] RSP: 0018:ffffb89da07a7df8 EFLAGS: 00010246
[ 2833.167210] RAX: 0000000000000000 RBX: ffff97d7d9332ec0 RCX: 0000000000000000
[ 2833.167685] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff97d7d9332ec0
[ 2833.168167] RBP: ffff97d6cd9cc000 R08: 0000000000000000 R09: 0000000000000000
[ 2833.168664] R10: ffffb89da07a7db0 R11: 0000000000000100 R12: ffff97d7dee497f0
[ 2833.169161] R13: ffff97d7dee497f0 R14: ffff97d7400e9d00 R15: ffff97d6cd9cc410
[ 2833.169621] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
[ 2833.170196] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2833.170578] CR2: 00007fa731cfe648 CR3: 0000000220b30004 CR4: 0000000000770ef0
[ 2833.171158] PKRU: 55555554
[ 2833.171394] Call Trace:
[ 2833.171596]  <TASK>
[ 2833.171779]  ? __warn+0x80/0x110
[ 2833.172044]  ? __put_task_struct+0x10c/0x180
[ 2833.172367]  ? report_bug+0x150/0x170
[ 2833.172637]  ? handle_bug+0x41/0x70
[ 2833.172899]  ? exc_invalid_op+0x17/0x70
[ 2833.173203]  ? asm_exc_invalid_op+0x1a/0x20
[ 2833.173522]  ? __put_task_struct+0x10c/0x180
[ 2833.173826]  ? io_put_task_remote+0x80/0x90
[ 2833.174153]  __io_submit_flush_completions+0x2bd/0x380
[ 2833.174509]  io_fallback_req_func+0xa3/0x130
[ 2833.174806]  process_one_work+0x181/0x350
[ 2833.175105]  worker_thread+0x27e/0x390
[ 2833.175394]  ? __pfx_worker_thread+0x10/0x10
[ 2833.175690]  kthread+0xbb/0xf0
[ 2833.175920]  ? __pfx_kthread+0x10/0x10
[ 2833.176226]  ret_from_fork+0x31/0x50
[ 2833.176485]  ? __pfx_kthread+0x10/0x10
[ 2833.176751]  ret_from_fork_asm+0x1a/0x30
[ 2833.177044]  </TASK>
[ 2833.177256] ---[ end trace 0000000000000000 ]---
[ 2833.177586] BUG: kernel NULL pointer dereference, address: 00000000000000e8
[ 2833.178054] #PF: supervisor read access in kernel mode
[ 2833.178424] #PF: error_code(0x0000) - not-present page
[ 2833.178776] PGD 21f4f9067 P4D 21f4f9067 PUD 21f4fa067 PMD 0
[ 2833.179182] Oops: 0000 [#3] SMP NOPTI
[ 2833.179464] CPU: 11 PID: 22867 Comm: kworker/11:0 Tainted: G      D W          6.8.0_io_uring_6.10+ #14
[ 2833.180110] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 2833.180692] Workqueue: events io_fallback_req_func
[ 2833.181042] RIP: 0010:percpu_counter_add_batch+0x19/0x80
[ 2833.181430] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 55 41 54 55 53 9c 58 0f 1f 40 00 49 89 c5 fa 0f 1f 44 00 00 <48> 8b 4f 20 48 63 d2 65 48 63 19 49 89 dc 48 01 f3 48 89 d8 48 f7
[ 2833.182623] RSP: 0018:ffffb89da07a7dd0 EFLAGS: 00010006
[ 2833.186362] RAX: 0000000000000206 RBX: ffff97d7d9332ec0 RCX: 0000000000000000
[ 2833.186825] RDX: 0000000000000020 RSI: ffffffffffffffff RDI: 00000000000000c8
[ 2833.187326] RBP: 0000000000000000 R08: 0000000000000246 R09: 0000000000020001
[ 2833.187783] R10: 0000000000020001 R11: 0000000000000032 R12: ffff97d7dee497f0
[ 2833.188284] R13: 0000000000000206 R14: ffff97d7400e9d00 R15: ffff97d6cd9cc410
[ 2833.188741] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
[ 2833.189310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2833.189709] CR2: 00000000000000e8 CR3: 0000000220b30004 CR4: 0000000000770ef0
[ 2833.190205] PKRU: 55555554
[ 2833.190418] Call Trace:
[ 2833.190615]  <TASK>
[ 2833.190795]  ? __die+0x23/0x70
[ 2833.191053]  ? page_fault_oops+0x173/0x4f0
[ 2833.191362]  ? exc_page_fault+0x76/0x150
[ 2833.191654]  ? asm_exc_page_fault+0x26/0x30
[ 2833.191968]  ? percpu_counter_add_batch+0x19/0x80
[ 2833.192313]  io_put_task_remote+0x2a/0x90
[ 2833.192594]  __io_submit_flush_completions+0x2bd/0x380
[ 2833.192944]  io_fallback_req_func+0xa3/0x130
[ 2833.193273]  process_one_work+0x181/0x350
[ 2833.193550]  worker_thread+0x27e/0x390
[ 2833.193813]  ? __pfx_worker_thread+0x10/0x10
[ 2833.194123]  kthread+0xbb/0xf0
[ 2833.194369]  ? __pfx_kthread+0x10/0x10
[ 2833.194638]  ret_from_fork+0x31/0x50
[ 2833.194899]  ? __pfx_kthread+0x10/0x10
[ 2833.195213]  ret_from_fork_asm+0x1a/0x30
[ 2833.195484]  </TASK>
[ 2833.195661] Modules linked in: isofs binfmt_misc vfat fat xfs raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support i2c_i801 virtio_net i2c_smbus net_failover failover lpc_ich ublk_drv loop zram nvme nvme_core usb_storage crc32c_intel virtio_scsi virtio_blk fuse qemu_fw_cfg
[ 2833.197148] CR2: 00000000000000e8
[ 2833.197400] ---[ end trace 0000000000000000 ]---
[ 2833.197714] RIP: 0010:percpu_counter_add_batch+0x19/0x80
[ 2833.198078] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 55 41 54 55 53 9c 58 0f 1f 40 00 49 89 c5 fa 0f 1f 44 00 00 <48> 8b 4f 20 48 63 d2 65 48 63 19 49 89 dc 48 01 f3 48 89 d8 48 f7
[ 2833.199261] RSP: 0018:ffffb89d8b0f7dd0 EFLAGS: 00010006
[ 2833.199599] RAX: 0000000000000206 RBX: ffff97d77b830000 RCX: 0000000080020001
[ 2833.200051] RDX: 0000000000000020 RSI: ffffffffffffffff RDI: 00000000000000c8
[ 2833.200515] RBP: 0000000000000000 R08: ffff97d77b830000 R09: 0000000080020001
[ 2833.200956] R10: 0000000080020001 R11: 0000000000000016 R12: ffff97d75210c6c0
[ 2833.201439] R13: 0000000000000206 R14: ffff97d7518f3800 R15: ffff97d6c304bc10
[ 2833.201894] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
[ 2833.202455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2833.202832] CR2: 00000000000000e8 CR3: 0000000220b30004 CR4: 0000000000770ef0
[ 2833.203316] PKRU: 55555554
[ 2833.203524] note: kworker/11:0[22867] exited with irqs disabled


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16 11:52   ` Ming Lei
@ 2024-03-16 13:27     ` Pavel Begunkov
  2024-03-16 13:56       ` Ming Lei
  2024-03-16 14:39       ` Jens Axboe
  0 siblings, 2 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-16 13:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/16/24 11:52, Ming Lei wrote:
> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>
>> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
>>> Patch 1 is a fix.
>>>
>>> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
>>> misundertsandings of the flags and of the tw state. It'd be great to have
>>> even without even w/o the rest.
>>>
>>> 8-11 mandate ctx locking for task_work and finally removes the CQE
>>> caches, instead we post directly into the CQ. Note that the cache is
>>> used by multishot auxiliary completions.
>>>
>>> [...]
>>
>> Applied, thanks!
>>
>> [02/11] io_uring/cmd: kill one issue_flags to tw conversion
>>          commit: 31ab0342cf6434e1e2879d12f0526830ce97365d
>> [03/11] io_uring/cmd: fix tw <-> issue_flags conversion
>>          commit: b48f3e29b89055894b3f50c657658c325b5b49fd
>> [04/11] io_uring/cmd: introduce io_uring_cmd_complete
>>          commit: c5b4c92ca69215c0af17e4e9d8c84c8942f3257d
>> [05/11] ublk: don't hard code IO_URING_F_UNLOCKED
>>          commit: c54cfb81fe1774231fca952eff928389bfc3b2e3
>> [06/11] nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
>>          commit: 800a90681f3c3383660a8e3e2d279e0f056afaee
>> [07/11] io_uring/rw: avoid punting to io-wq directly
>>          commit: 56d565d54373c17b7620fc605c899c41968e48d0
>> [08/11] io_uring: force tw ctx locking
>>          commit: f087cdd065af0418ffc8a9ed39eadc93347efdd5
>> [09/11] io_uring: remove struct io_tw_state::locked
>>          commit: 339f8d66e996ec52b47221448ff4b3534cc9a58d
>> [10/11] io_uring: refactor io_fill_cqe_req_aux
>>          commit: 7b31c3964b769a6a16c4e414baa8094b441e498e
>> [11/11] io_uring: get rid of intermediate aux cqe caches
>>          commit: 5a475a1f47412a44ed184aac04b9ff0aeaa31d65
> 
> Hi Jens and Pavel,

Jens, I hope you already dropped the series for now, right?

> 
> The following two error can be triggered with this patchset
> when running some ublk stress test(io vs. deletion). And not see
> such failures after reverting the 11 patches.

I suppose it's with the fix from yesterday. How can I
reproduce it, blktests?



> 1) error 1
> 
> [  318.843517] ------------[ cut here ]------------
> [  318.843937] kernel BUG at mm/slub.c:553!
> [  318.844235] invalid opcode: 0000 [#1] SMP NOPTI
> [  318.844580] CPU: 7 PID: 1475 Comm: kworker/u48:13 Not tainted 6.8.0_io_uring_6.10+ #14
> [  318.845133] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [  318.845732] Workqueue: events_unbound io_ring_exit_work
> [  318.846104] RIP: 0010:__slab_free+0x152/0x2f0
> [  318.846434] Code: 00 4c 89 ff e8 ef 41 bc 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff4
> [  318.851192] RSP: 0018:ffffb490411abcb0 EFLAGS: 00010246
> [  318.851574] RAX: ffff8b0e871e44f0 RBX: fffff113841c7900 RCX: 0000000000200010
> [  318.852032] RDX: ffff8b0e871e4400 RSI: fffff113841c7900 RDI: ffffb490411abd20
> [  318.852521] RBP: ffffb490411abd50 R08: 0000000000000001 R09: ffffffffa17e4deb
> [  318.852981] R10: 0000000000200010 R11: 0000000000000024 R12: ffff8b0e80292c00
> [  318.853472] R13: ffff8b0e871e4400 R14: ffff8b0e80292c00 R15: ffffffffa17e4deb
> [  318.853911] FS:  0000000000000000(0000) GS:ffff8b13e7b80000(0000) knlGS:0000000000000000
> [  318.854448] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  318.854831] CR2: 00007fbce5249298 CR3: 0000000363020002 CR4: 0000000000770ef0
> [  318.855291] PKRU: 55555554
> [  318.855533] Call Trace:
> [  318.855724]  <TASK>
> [  318.855898]  ? die+0x36/0x90
> [  318.856121]  ? do_trap+0xdd/0x100
> [  318.856389]  ? __slab_free+0x152/0x2f0
> [  318.856674]  ? do_error_trap+0x6a/0x90
> [  318.856939]  ? __slab_free+0x152/0x2f0
> [  318.857202]  ? exc_invalid_op+0x50/0x70
> [  318.857505]  ? __slab_free+0x152/0x2f0
> [  318.857770]  ? asm_exc_invalid_op+0x1a/0x20
> [  318.858056]  ? io_req_caches_free+0x9b/0x100
> [  318.858439]  ? io_req_caches_free+0x9b/0x100
> [  318.858961]  ? __slab_free+0x152/0x2f0
> [  318.859466]  ? __memcg_slab_free_hook+0xd9/0x130
> [  318.859941]  ? io_req_caches_free+0x9b/0x100
> [  318.860395]  kmem_cache_free+0x2eb/0x3b0
> [  318.860826]  io_req_caches_free+0x9b/0x100
> [  318.861190]  io_ring_exit_work+0x105/0x5c0
> [  318.861496]  ? __schedule+0x3d4/0x1510
> [  318.861761]  process_one_work+0x181/0x350
> [  318.862042]  worker_thread+0x27e/0x390
> [  318.862307]  ? __pfx_worker_thread+0x10/0x10
> [  318.862621]  kthread+0xbb/0xf0
> [  318.862854]  ? __pfx_kthread+0x10/0x10
> [  318.863124]  ret_from_fork+0x31/0x50
> [  318.863397]  ? __pfx_kthread+0x10/0x10
> [  318.863665]  ret_from_fork_asm+0x1a/0x30
> [  318.863943]  </TASK>
> [  318.864122] Modules linked in: isofs binfmt_misc xfs vfat fat raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support virtio_net net_failover i2g
> [  318.865638] ---[ end trace 0000000000000000 ]---
> [  318.865966] RIP: 0010:__slab_free+0x152/0x2f0
> [  318.866267] Code: 00 4c 89 ff e8 ef 41 bc 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff4
> [  318.867622] RSP: 0018:ffffb490411abcb0 EFLAGS: 00010246
> [  318.868103] RAX: ffff8b0e871e44f0 RBX: fffff113841c7900 RCX: 0000000000200010
> [  318.868602] RDX: ffff8b0e871e4400 RSI: fffff113841c7900 RDI: ffffb490411abd20
> [  318.869051] RBP: ffffb490411abd50 R08: 0000000000000001 R09: ffffffffa17e4deb
> [  318.869544] R10: 0000000000200010 R11: 0000000000000024 R12: ffff8b0e80292c00
> [  318.870028] R13: ffff8b0e871e4400 R14: ffff8b0e80292c00 R15: ffffffffa17e4deb
> [  318.870550] FS:  0000000000000000(0000) GS:ffff8b13e7b80000(0000) knlGS:0000000000000000
> [  318.871080] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  318.871509] CR2: 00007fbce5249298 CR3: 0000000363020002 CR4: 0000000000770ef0
> [  318.871974] PKRU: 55555554
> 
> 2) error 2
> 
> [ 2833.161174] ------------[ cut here ]------------
> [ 2833.161527] WARNING: CPU: 11 PID: 22867 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
> [ 2833.162114] Modules linked in: isofs binfmt_misc vfat fat xfs raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support i2c_i801 virtio_net i2c_smbus net_failover failover lpc_ich ublk_drv loop zram nvme nvme_core usb_storage crc32c_intel virtio_scsi virtio_blk fuse qemu_fw_cfg
> [ 2833.163650] CPU: 11 PID: 22867 Comm: kworker/11:0 Tainted: G      D W          6.8.0_io_uring_6.10+ #14
> [ 2833.164289] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [ 2833.164860] Workqueue: events io_fallback_req_func
> [ 2833.165224] RIP: 0010:__put_task_struct+0x10c/0x180
> [ 2833.165586] Code: 48 85 d2 74 05 f0 ff 0a 74 44 48 8b 3d d5 b6 c7 02 48 89 ee e8 65 b7 2d 00 eb ac be 03 00 00 00 48 89 ef e8 36 82 70 00 eb 9d <0f> 0b 8b 43 28 85 c0 0f 84 0e ff ff ff 0f 0b 65 48 3b 1d 5d d2 f2
> [ 2833.166819] RSP: 0018:ffffb89da07a7df8 EFLAGS: 00010246
> [ 2833.167210] RAX: 0000000000000000 RBX: ffff97d7d9332ec0 RCX: 0000000000000000
> [ 2833.167685] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff97d7d9332ec0
> [ 2833.168167] RBP: ffff97d6cd9cc000 R08: 0000000000000000 R09: 0000000000000000
> [ 2833.168664] R10: ffffb89da07a7db0 R11: 0000000000000100 R12: ffff97d7dee497f0
> [ 2833.169161] R13: ffff97d7dee497f0 R14: ffff97d7400e9d00 R15: ffff97d6cd9cc410
> [ 2833.169621] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
> [ 2833.170196] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2833.170578] CR2: 00007fa731cfe648 CR3: 0000000220b30004 CR4: 0000000000770ef0
> [ 2833.171158] PKRU: 55555554
> [ 2833.171394] Call Trace:
> [ 2833.171596]  <TASK>
> [ 2833.171779]  ? __warn+0x80/0x110
> [ 2833.172044]  ? __put_task_struct+0x10c/0x180
> [ 2833.172367]  ? report_bug+0x150/0x170
> [ 2833.172637]  ? handle_bug+0x41/0x70
> [ 2833.172899]  ? exc_invalid_op+0x17/0x70
> [ 2833.173203]  ? asm_exc_invalid_op+0x1a/0x20
> [ 2833.173522]  ? __put_task_struct+0x10c/0x180
> [ 2833.173826]  ? io_put_task_remote+0x80/0x90
> [ 2833.174153]  __io_submit_flush_completions+0x2bd/0x380
> [ 2833.174509]  io_fallback_req_func+0xa3/0x130
> [ 2833.174806]  process_one_work+0x181/0x350
> [ 2833.175105]  worker_thread+0x27e/0x390
> [ 2833.175394]  ? __pfx_worker_thread+0x10/0x10
> [ 2833.175690]  kthread+0xbb/0xf0
> [ 2833.175920]  ? __pfx_kthread+0x10/0x10
> [ 2833.176226]  ret_from_fork+0x31/0x50
> [ 2833.176485]  ? __pfx_kthread+0x10/0x10
> [ 2833.176751]  ret_from_fork_asm+0x1a/0x30
> [ 2833.177044]  </TASK>
> [ 2833.177256] ---[ end trace 0000000000000000 ]---
> [ 2833.177586] BUG: kernel NULL pointer dereference, address: 00000000000000e8
> [ 2833.178054] #PF: supervisor read access in kernel mode
> [ 2833.178424] #PF: error_code(0x0000) - not-present page
> [ 2833.178776] PGD 21f4f9067 P4D 21f4f9067 PUD 21f4fa067 PMD 0
> [ 2833.179182] Oops: 0000 [#3] SMP NOPTI
> [ 2833.179464] CPU: 11 PID: 22867 Comm: kworker/11:0 Tainted: G      D W          6.8.0_io_uring_6.10+ #14
> [ 2833.180110] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [ 2833.180692] Workqueue: events io_fallback_req_func
> [ 2833.181042] RIP: 0010:percpu_counter_add_batch+0x19/0x80
> [ 2833.181430] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 55 41 54 55 53 9c 58 0f 1f 40 00 49 89 c5 fa 0f 1f 44 00 00 <48> 8b 4f 20 48 63 d2 65 48 63 19 49 89 dc 48 01 f3 48 89 d8 48 f7
> [ 2833.182623] RSP: 0018:ffffb89da07a7dd0 EFLAGS: 00010006
> [ 2833.186362] RAX: 0000000000000206 RBX: ffff97d7d9332ec0 RCX: 0000000000000000
> [ 2833.186825] RDX: 0000000000000020 RSI: ffffffffffffffff RDI: 00000000000000c8
> [ 2833.187326] RBP: 0000000000000000 R08: 0000000000000246 R09: 0000000000020001
> [ 2833.187783] R10: 0000000000020001 R11: 0000000000000032 R12: ffff97d7dee497f0
> [ 2833.188284] R13: 0000000000000206 R14: ffff97d7400e9d00 R15: ffff97d6cd9cc410
> [ 2833.188741] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
> [ 2833.189310] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2833.189709] CR2: 00000000000000e8 CR3: 0000000220b30004 CR4: 0000000000770ef0
> [ 2833.190205] PKRU: 55555554
> [ 2833.190418] Call Trace:
> [ 2833.190615]  <TASK>
> [ 2833.190795]  ? __die+0x23/0x70
> [ 2833.191053]  ? page_fault_oops+0x173/0x4f0
> [ 2833.191362]  ? exc_page_fault+0x76/0x150
> [ 2833.191654]  ? asm_exc_page_fault+0x26/0x30
> [ 2833.191968]  ? percpu_counter_add_batch+0x19/0x80
> [ 2833.192313]  io_put_task_remote+0x2a/0x90
> [ 2833.192594]  __io_submit_flush_completions+0x2bd/0x380
> [ 2833.192944]  io_fallback_req_func+0xa3/0x130
> [ 2833.193273]  process_one_work+0x181/0x350
> [ 2833.193550]  worker_thread+0x27e/0x390
> [ 2833.193813]  ? __pfx_worker_thread+0x10/0x10
> [ 2833.194123]  kthread+0xbb/0xf0
> [ 2833.194369]  ? __pfx_kthread+0x10/0x10
> [ 2833.194638]  ret_from_fork+0x31/0x50
> [ 2833.194899]  ? __pfx_kthread+0x10/0x10
> [ 2833.195213]  ret_from_fork_asm+0x1a/0x30
> [ 2833.195484]  </TASK>
> [ 2833.195661] Modules linked in: isofs binfmt_misc vfat fat xfs raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support i2c_i801 virtio_net i2c_smbus net_failover failover lpc_ich ublk_drv loop zram nvme nvme_core usb_storage crc32c_intel virtio_scsi virtio_blk fuse qemu_fw_cfg
> [ 2833.197148] CR2: 00000000000000e8
> [ 2833.197400] ---[ end trace 0000000000000000 ]---
> [ 2833.197714] RIP: 0010:percpu_counter_add_batch+0x19/0x80
> [ 2833.198078] Code: 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 41 55 41 54 55 53 9c 58 0f 1f 40 00 49 89 c5 fa 0f 1f 44 00 00 <48> 8b 4f 20 48 63 d2 65 48 63 19 49 89 dc 48 01 f3 48 89 d8 48 f7
> [ 2833.199261] RSP: 0018:ffffb89d8b0f7dd0 EFLAGS: 00010006
> [ 2833.199599] RAX: 0000000000000206 RBX: ffff97d77b830000 RCX: 0000000080020001
> [ 2833.200051] RDX: 0000000000000020 RSI: ffffffffffffffff RDI: 00000000000000c8
> [ 2833.200515] RBP: 0000000000000000 R08: ffff97d77b830000 R09: 0000000080020001
> [ 2833.200956] R10: 0000000080020001 R11: 0000000000000016 R12: ffff97d75210c6c0
> [ 2833.201439] R13: 0000000000000206 R14: ffff97d7518f3800 R15: ffff97d6c304bc10
> [ 2833.201894] FS:  0000000000000000(0000) GS:ffff97dc27d80000(0000) knlGS:0000000000000000
> [ 2833.202455] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2833.202832] CR2: 00000000000000e8 CR3: 0000000220b30004 CR4: 0000000000770ef0
> [ 2833.203316] PKRU: 55555554
> [ 2833.203524] note: kworker/11:0[22867] exited with irqs disabled
> 
> 
> Thanks,
> Ming
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16 13:27     ` Pavel Begunkov
@ 2024-03-16 13:56       ` Ming Lei
  2024-03-17 20:55         ` Pavel Begunkov
  2024-03-16 14:39       ` Jens Axboe
  1 sibling, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-16 13:56 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi

On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
> On 3/16/24 11:52, Ming Lei wrote:
> > On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:

...

> > The following two error can be triggered with this patchset
> > when running some ublk stress test(io vs. deletion). And not see
> > such failures after reverting the 11 patches.
> 
> I suppose it's with the fix from yesterday. How can I
> reproduce it, blktests?

Yeah, it needs yesterday's fix.

You may need to run this test multiple times for triggering the problem:

1) git clone https://github.com/ublk-org/ublksrv.git

2) cd ublksrv

3) make test T=generic/004


Thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16 13:27     ` Pavel Begunkov
  2024-03-16 13:56       ` Ming Lei
@ 2024-03-16 14:39       ` Jens Axboe
  1 sibling, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-16 14:39 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/16/24 7:27 AM, Pavel Begunkov wrote:
> On 3/16/24 11:52, Ming Lei wrote:
>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>
>>> On Fri, 15 Mar 2024 15:29:50 +0000, Pavel Begunkov wrote:
>>>> Patch 1 is a fix.
>>>>
>>>> Patches 2-7 are cleanups mainly dealing with issue_flags conversions,
>>>> misundertsandings of the flags and of the tw state. It'd be great to have
>>>> even without even w/o the rest.
>>>>
>>>> 8-11 mandate ctx locking for task_work and finally removes the CQE
>>>> caches, instead we post directly into the CQ. Note that the cache is
>>>> used by multishot auxiliary completions.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [02/11] io_uring/cmd: kill one issue_flags to tw conversion
>>>          commit: 31ab0342cf6434e1e2879d12f0526830ce97365d
>>> [03/11] io_uring/cmd: fix tw <-> issue_flags conversion
>>>          commit: b48f3e29b89055894b3f50c657658c325b5b49fd
>>> [04/11] io_uring/cmd: introduce io_uring_cmd_complete
>>>          commit: c5b4c92ca69215c0af17e4e9d8c84c8942f3257d
>>> [05/11] ublk: don't hard code IO_URING_F_UNLOCKED
>>>          commit: c54cfb81fe1774231fca952eff928389bfc3b2e3
>>> [06/11] nvme/io_uring: don't hard code IO_URING_F_UNLOCKED
>>>          commit: 800a90681f3c3383660a8e3e2d279e0f056afaee
>>> [07/11] io_uring/rw: avoid punting to io-wq directly
>>>          commit: 56d565d54373c17b7620fc605c899c41968e48d0
>>> [08/11] io_uring: force tw ctx locking
>>>          commit: f087cdd065af0418ffc8a9ed39eadc93347efdd5
>>> [09/11] io_uring: remove struct io_tw_state::locked
>>>          commit: 339f8d66e996ec52b47221448ff4b3534cc9a58d
>>> [10/11] io_uring: refactor io_fill_cqe_req_aux
>>>          commit: 7b31c3964b769a6a16c4e414baa8094b441e498e
>>> [11/11] io_uring: get rid of intermediate aux cqe caches
>>>          commit: 5a475a1f47412a44ed184aac04b9ff0aeaa31d65
>>
>> Hi Jens and Pavel,
> 
> Jens, I hope you already dropped the series for now, right?

It's just sitting in a branch for now, it's not even in linux-next. I'll
review and look at a v2 of the series. So it hasn't moved anywhere yet.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-16 13:56       ` Ming Lei
@ 2024-03-17 20:55         ` Pavel Begunkov
  2024-03-17 21:24           ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-17 20:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block, Kanchan Joshi

On 3/16/24 13:56, Ming Lei wrote:
> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>> On 3/16/24 11:52, Ming Lei wrote:
>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> 
> ...
> 
>>> The following two error can be triggered with this patchset
>>> when running some ublk stress test(io vs. deletion). And not see
>>> such failures after reverting the 11 patches.
>>
>> I suppose it's with the fix from yesterday. How can I
>> reproduce it, blktests?
> 
> Yeah, it needs yesterday's fix.
> 
> You may need to run this test multiple times for triggering the problem:

Thanks for all the testing. I've tried it, all ublk/generic tests hang
in userspace waiting for CQEs but no complaints from the kernel.
However, it seems the branch is buggy even without my patches, I
consistently (5-15 minutes of running in a slow VM) hit page underflow
by running liburing tests. Not sure what is that yet, but might also
be the reason.

I'll repost it with the locking fix for reference, would make more
sense retesting ublk after figuring out what's up with the branch.


> 1) git clone https://github.com/ublk-org/ublksrv.git
> 
> 2) cd ublksrv
> 
> 3) make test T=generic/004

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 20:55         ` Pavel Begunkov
@ 2024-03-17 21:24           ` Jens Axboe
  2024-03-17 21:29             ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-17 21:24 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 2:55 PM, Pavel Begunkov wrote:
> On 3/16/24 13:56, Ming Lei wrote:
>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>> On 3/16/24 11:52, Ming Lei wrote:
>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>
>> ...
>>
>>>> The following two error can be triggered with this patchset
>>>> when running some ublk stress test(io vs. deletion). And not see
>>>> such failures after reverting the 11 patches.
>>>
>>> I suppose it's with the fix from yesterday. How can I
>>> reproduce it, blktests?
>>
>> Yeah, it needs yesterday's fix.
>>
>> You may need to run this test multiple times for triggering the problem:
> 
> Thanks for all the testing. I've tried it, all ublk/generic tests hang
> in userspace waiting for CQEs but no complaints from the kernel.
> However, it seems the branch is buggy even without my patches, I
> consistently (5-15 minutes of running in a slow VM) hit page underflow
> by running liburing tests. Not sure what is that yet, but might also
> be the reason.

Hmm odd, there's nothing in there but your series and then the
io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
merge window -git cycle? Does it happen with io_uring-6.9 as well? I
haven't seen anything odd.

> I'll repost it with the locking fix for reference, would make more
> sense retesting ublk after figuring out what's up with the branch.

Yep if you repost it with the fix, I'll rebase for-6.10/io_uring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:24           ` Jens Axboe
@ 2024-03-17 21:29             ` Pavel Begunkov
  2024-03-17 21:32               ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-17 21:29 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 21:24, Jens Axboe wrote:
> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>> On 3/16/24 13:56, Ming Lei wrote:
>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>
>>> ...
>>>
>>>>> The following two error can be triggered with this patchset
>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>> such failures after reverting the 11 patches.
>>>>
>>>> I suppose it's with the fix from yesterday. How can I
>>>> reproduce it, blktests?
>>>
>>> Yeah, it needs yesterday's fix.
>>>
>>> You may need to run this test multiple times for triggering the problem:
>>
>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>> in userspace waiting for CQEs but no complaints from the kernel.
>> However, it seems the branch is buggy even without my patches, I
>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>> by running liburing tests. Not sure what is that yet, but might also
>> be the reason.
> 
> Hmm odd, there's nothing in there but your series and then the
> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
> haven't seen anything odd.

Need to test io_uring-6.9. I actually checked the branch twice, both
with the issue, and by full recompilation and config prompts I assumed
you pulled something in between (maybe not).

And yeah, I can't confirm it's specifically an io_uring bug, the
stack trace is usually some unmap or task exit, sometimes it only
shows when you try to shutdown the VM after tests.

>> I'll repost it with the locking fix for reference, would make more
>> sense retesting ublk after figuring out what's up with the branch.
> 
> Yep if you repost it with the fix, I'll rebase for-6.10/io_uring.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:29             ` Pavel Begunkov
@ 2024-03-17 21:32               ` Jens Axboe
  2024-03-17 21:34                 ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-17 21:32 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 3:29 PM, Pavel Begunkov wrote:
> On 3/17/24 21:24, Jens Axboe wrote:
>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>> On 3/16/24 13:56, Ming Lei wrote:
>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>
>>>> ...
>>>>
>>>>>> The following two error can be triggered with this patchset
>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>> such failures after reverting the 11 patches.
>>>>>
>>>>> I suppose it's with the fix from yesterday. How can I
>>>>> reproduce it, blktests?
>>>>
>>>> Yeah, it needs yesterday's fix.
>>>>
>>>> You may need to run this test multiple times for triggering the problem:
>>>
>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>> in userspace waiting for CQEs but no complaints from the kernel.
>>> However, it seems the branch is buggy even without my patches, I
>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>> by running liburing tests. Not sure what is that yet, but might also
>>> be the reason.
>>
>> Hmm odd, there's nothing in there but your series and then the
>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>> haven't seen anything odd.
> 
> Need to test io_uring-6.9. I actually checked the branch twice, both
> with the issue, and by full recompilation and config prompts I assumed
> you pulled something in between (maybe not).
> 
> And yeah, I can't confirm it's specifically an io_uring bug, the
> stack trace is usually some unmap or task exit, sometimes it only
> shows when you try to shutdown the VM after tests.

Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
test case as well on io_uring-6.9 and it all worked fine. Trying
liburing tests on for-6.10/io_uring as well now, but didn't see anything
the other times I ran it. In any case, once you repost I'll rebase and
then let's see if it hits again.

Did you run with KASAN enabled?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:32               ` Jens Axboe
@ 2024-03-17 21:34                 ` Pavel Begunkov
  2024-03-17 21:47                   ` Pavel Begunkov
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-17 21:34 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 21:32, Jens Axboe wrote:
> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>> On 3/17/24 21:24, Jens Axboe wrote:
>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>>> The following two error can be triggered with this patchset
>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>> such failures after reverting the 11 patches.
>>>>>>
>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>> reproduce it, blktests?
>>>>>
>>>>> Yeah, it needs yesterday's fix.
>>>>>
>>>>> You may need to run this test multiple times for triggering the problem:
>>>>
>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>> However, it seems the branch is buggy even without my patches, I
>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>> by running liburing tests. Not sure what is that yet, but might also
>>>> be the reason.
>>>
>>> Hmm odd, there's nothing in there but your series and then the
>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>> haven't seen anything odd.
>>
>> Need to test io_uring-6.9. I actually checked the branch twice, both
>> with the issue, and by full recompilation and config prompts I assumed
>> you pulled something in between (maybe not).
>>
>> And yeah, I can't confirm it's specifically an io_uring bug, the
>> stack trace is usually some unmap or task exit, sometimes it only
>> shows when you try to shutdown the VM after tests.
> 
> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
> test case as well on io_uring-6.9 and it all worked fine. Trying
> liburing tests on for-6.10/io_uring as well now, but didn't see anything
> the other times I ran it. In any case, once you repost I'll rebase and
> then let's see if it hits again.
> 
> Did you run with KASAN enabled

Yes, it's a debug kernel, full on KASANs, lockdeps and so

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:34                 ` Pavel Begunkov
@ 2024-03-17 21:47                   ` Pavel Begunkov
  2024-03-17 21:51                     ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-17 21:47 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 21:34, Pavel Begunkov wrote:
> On 3/17/24 21:32, Jens Axboe wrote:
>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>
>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>> reproduce it, blktests?
>>>>>>
>>>>>> Yeah, it needs yesterday's fix.
>>>>>>
>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>
>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>> However, it seems the branch is buggy even without my patches, I
>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>> be the reason.
>>>>
>>>> Hmm odd, there's nothing in there but your series and then the
>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>> haven't seen anything odd.
>>>
>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>> with the issue, and by full recompilation and config prompts I assumed
>>> you pulled something in between (maybe not).
>>>
>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>> stack trace is usually some unmap or task exit, sometimes it only
>>> shows when you try to shutdown the VM after tests.
>>
>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>> test case as well on io_uring-6.9 and it all worked fine. Trying
>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>> the other times I ran it. In any case, once you repost I'll rebase and
>> then let's see if it hits again.
>>
>> Did you run with KASAN enabled
> 
> Yes, it's a debug kernel, full on KASANs, lockdeps and so

And another note, I triggered it once (IIRC on shutdown) with ublk
tests only w/o liburing/tests, likely limits it to either the core
io_uring infra or non-io_uring bugs.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:47                   ` Pavel Begunkov
@ 2024-03-17 21:51                     ` Jens Axboe
  2024-03-17 22:07                       ` Jens Axboe
  2024-03-17 23:16                       ` Pavel Begunkov
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-17 21:51 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 3:47 PM, Pavel Begunkov wrote:
> On 3/17/24 21:34, Pavel Begunkov wrote:
>> On 3/17/24 21:32, Jens Axboe wrote:
>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>
>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>> reproduce it, blktests?
>>>>>>>
>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>
>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>
>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>> be the reason.
>>>>>
>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>> haven't seen anything odd.
>>>>
>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>> with the issue, and by full recompilation and config prompts I assumed
>>>> you pulled something in between (maybe not).
>>>>
>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>> shows when you try to shutdown the VM after tests.
>>>
>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>> the other times I ran it. In any case, once you repost I'll rebase and
>>> then let's see if it hits again.
>>>
>>> Did you run with KASAN enabled
>>
>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
> 
> And another note, I triggered it once (IIRC on shutdown) with ublk
> tests only w/o liburing/tests, likely limits it to either the core
> io_uring infra or non-io_uring bugs.

Been running on for-6.10/io_uring, and the only odd thing I see is that
the test output tends to stall here:

Running test read-before-exit.t

which then either leads to a connection disconnect from my ssh into that
vm, or just a long delay and then it picks up again. This did not happen
with io_uring-6.9.

Maybe related? At least it's something new. Just checked again, and yeah
it seems to totally lock up the vm while that is running. Will try a
quick bisect of that series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:51                     ` Jens Axboe
@ 2024-03-17 22:07                       ` Jens Axboe
  2024-03-17 22:24                         ` Jens Axboe
  2024-03-17 23:16                       ` Pavel Begunkov
  1 sibling, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-17 22:07 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 3:51 PM, Jens Axboe wrote:
> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
>> On 3/17/24 21:34, Pavel Begunkov wrote:
>>> On 3/17/24 21:32, Jens Axboe wrote:
>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>>
>>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>>> reproduce it, blktests?
>>>>>>>>
>>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>>
>>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>>
>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>>> be the reason.
>>>>>>
>>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>>> haven't seen anything odd.
>>>>>
>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>>> with the issue, and by full recompilation and config prompts I assumed
>>>>> you pulled something in between (maybe not).
>>>>>
>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>>> shows when you try to shutdown the VM after tests.
>>>>
>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>>> the other times I ran it. In any case, once you repost I'll rebase and
>>>> then let's see if it hits again.
>>>>
>>>> Did you run with KASAN enabled
>>>
>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
>>
>> And another note, I triggered it once (IIRC on shutdown) with ublk
>> tests only w/o liburing/tests, likely limits it to either the core
>> io_uring infra or non-io_uring bugs.
> 
> Been running on for-6.10/io_uring, and the only odd thing I see is that
> the test output tends to stall here:
> 
> Running test read-before-exit.t
> 
> which then either leads to a connection disconnect from my ssh into that
> vm, or just a long delay and then it picks up again. This did not happen
> with io_uring-6.9.
> 
> Maybe related? At least it's something new. Just checked again, and yeah
> it seems to totally lock up the vm while that is running. Will try a
> quick bisect of that series.

Seems to be triggered by the top of branch patch in there, my poll and
timeout special casing. While the above test case runs with that commit,
it'll freeze the host.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 22:07                       ` Jens Axboe
@ 2024-03-17 22:24                         ` Jens Axboe
  2024-03-18  0:15                           ` Ming Lei
  0 siblings, 1 reply; 54+ messages in thread
From: Jens Axboe @ 2024-03-17 22:24 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 4:07 PM, Jens Axboe wrote:
> On 3/17/24 3:51 PM, Jens Axboe wrote:
>> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
>>> On 3/17/24 21:34, Pavel Begunkov wrote:
>>>> On 3/17/24 21:32, Jens Axboe wrote:
>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>>>
>>>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>>>> reproduce it, blktests?
>>>>>>>>>
>>>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>>>
>>>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>>>
>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>>>> be the reason.
>>>>>>>
>>>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>>>> haven't seen anything odd.
>>>>>>
>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>>>> with the issue, and by full recompilation and config prompts I assumed
>>>>>> you pulled something in between (maybe not).
>>>>>>
>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>>>> shows when you try to shutdown the VM after tests.
>>>>>
>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>>>> the other times I ran it. In any case, once you repost I'll rebase and
>>>>> then let's see if it hits again.
>>>>>
>>>>> Did you run with KASAN enabled
>>>>
>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
>>>
>>> And another note, I triggered it once (IIRC on shutdown) with ublk
>>> tests only w/o liburing/tests, likely limits it to either the core
>>> io_uring infra or non-io_uring bugs.
>>
>> Been running on for-6.10/io_uring, and the only odd thing I see is that
>> the test output tends to stall here:
>>
>> Running test read-before-exit.t
>>
>> which then either leads to a connection disconnect from my ssh into that
>> vm, or just a long delay and then it picks up again. This did not happen
>> with io_uring-6.9.
>>
>> Maybe related? At least it's something new. Just checked again, and yeah
>> it seems to totally lock up the vm while that is running. Will try a
>> quick bisect of that series.
> 
> Seems to be triggered by the top of branch patch in there, my poll and
> timeout special casing. While the above test case runs with that commit,
> it'll freeze the host.

Had a feeling this was the busy looping off cancelations, and flushing
the fallback task_work seems to fix it. I'll check more tomorrow.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a2cb8da3cc33..f1d3c5e065e9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_kill_timeouts(ctx, task, cancel_all);
 	if (task)
 		ret |= io_run_task_work() > 0;
+	else if (ret)
+		flush_delayed_work(&ctx->fallback_work);
 	return ret;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 21:51                     ` Jens Axboe
  2024-03-17 22:07                       ` Jens Axboe
@ 2024-03-17 23:16                       ` Pavel Begunkov
  1 sibling, 0 replies; 54+ messages in thread
From: Pavel Begunkov @ 2024-03-17 23:16 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei; +Cc: io-uring, linux-block, Kanchan Joshi

On 3/17/24 21:51, Jens Axboe wrote:
> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
>> On 3/17/24 21:34, Pavel Begunkov wrote:
>>> On 3/17/24 21:32, Jens Axboe wrote:
>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>>
>>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>>> reproduce it, blktests?
>>>>>>>>
>>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>>
>>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>>
>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>>> be the reason.
>>>>>>
>>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>>> haven't seen anything odd.
>>>>>
>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>>> with the issue, and by full recompilation and config prompts I assumed
>>>>> you pulled something in between (maybe not).
>>>>>
>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>>> shows when you try to shutdown the VM after tests.
>>>>
>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>>> the other times I ran it. In any case, once you repost I'll rebase and
>>>> then let's see if it hits again.
>>>>
>>>> Did you run with KASAN enabled
>>>
>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
>>
>> And another note, I triggered it once (IIRC on shutdown) with ublk
>> tests only w/o liburing/tests, likely limits it to either the core
>> io_uring infra or non-io_uring bugs.
> 
> Been running on for-6.10/io_uring, and the only odd thing I see is that
> the test output tends to stall here:

can't trigger with io_uring-5.9, which makes sense because the
patchset was done on top of it and tested w/o problems.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-17 22:24                         ` Jens Axboe
@ 2024-03-18  0:15                           ` Ming Lei
  2024-03-18  1:34                             ` Jens Axboe
  0 siblings, 1 reply; 54+ messages in thread
From: Ming Lei @ 2024-03-18  0:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block, Kanchan Joshi

On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote:
> On 3/17/24 4:07 PM, Jens Axboe wrote:
> > On 3/17/24 3:51 PM, Jens Axboe wrote:
> >> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
> >>> On 3/17/24 21:34, Pavel Begunkov wrote:
> >>>> On 3/17/24 21:32, Jens Axboe wrote:
> >>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
> >>>>>> On 3/17/24 21:24, Jens Axboe wrote:
> >>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
> >>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
> >>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
> >>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
> >>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> >>>>>>>>>
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>>>> The following two error can be triggered with this patchset
> >>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
> >>>>>>>>>>> such failures after reverting the 11 patches.
> >>>>>>>>>>
> >>>>>>>>>> I suppose it's with the fix from yesterday. How can I
> >>>>>>>>>> reproduce it, blktests?
> >>>>>>>>>
> >>>>>>>>> Yeah, it needs yesterday's fix.
> >>>>>>>>>
> >>>>>>>>> You may need to run this test multiple times for triggering the problem:
> >>>>>>>>
> >>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
> >>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
> >>>>>>>> However, it seems the branch is buggy even without my patches, I
> >>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
> >>>>>>>> by running liburing tests. Not sure what is that yet, but might also
> >>>>>>>> be the reason.
> >>>>>>>
> >>>>>>> Hmm odd, there's nothing in there but your series and then the
> >>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
> >>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
> >>>>>>> haven't seen anything odd.
> >>>>>>
> >>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
> >>>>>> with the issue, and by full recompilation and config prompts I assumed
> >>>>>> you pulled something in between (maybe not).
> >>>>>>
> >>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
> >>>>>> stack trace is usually some unmap or task exit, sometimes it only
> >>>>>> shows when you try to shutdown the VM after tests.
> >>>>>
> >>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
> >>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
> >>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
> >>>>> the other times I ran it. In any case, once you repost I'll rebase and
> >>>>> then let's see if it hits again.
> >>>>>
> >>>>> Did you run with KASAN enabled
> >>>>
> >>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
> >>>
> >>> And another note, I triggered it once (IIRC on shutdown) with ublk
> >>> tests only w/o liburing/tests, likely limits it to either the core
> >>> io_uring infra or non-io_uring bugs.
> >>
> >> Been running on for-6.10/io_uring, and the only odd thing I see is that
> >> the test output tends to stall here:
> >>
> >> Running test read-before-exit.t
> >>
> >> which then either leads to a connection disconnect from my ssh into that
> >> vm, or just a long delay and then it picks up again. This did not happen
> >> with io_uring-6.9.
> >>
> >> Maybe related? At least it's something new. Just checked again, and yeah
> >> it seems to totally lock up the vm while that is running. Will try a
> >> quick bisect of that series.
> > 
> > Seems to be triggered by the top of branch patch in there, my poll and
> > timeout special casing. While the above test case runs with that commit,
> > it'll freeze the host.
> 
> Had a feeling this was the busy looping off cancelations, and flushing
> the fallback task_work seems to fix it. I'll check more tomorrow.
> 
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index a2cb8da3cc33..f1d3c5e065e9 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
>  	if (task)
>  		ret |= io_run_task_work() > 0;
> +	else if (ret)
> +		flush_delayed_work(&ctx->fallback_work);
>  	return ret;
>  }

Still can trigger the warning with above patch:

[  446.275975] ------------[ cut here ]------------
[  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
[  446.276931] Modules linked in: isofs binfmt_misc xfs vfat fat raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support virtio_net i2c_i801 net_fag
[  446.278608] CPU: 8 PID: 731 Comm: kworker/8:2 Not tainted 6.8.0_io_uring_6.10+ #20
[  446.279535] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  446.280440] Workqueue: events io_fallback_req_func
[  446.280971] RIP: 0010:__put_task_struct+0x10c/0x180
[  446.281485] Code: 48 85 d2 74 05 f0 ff 0a 74 44 48 8b 3d b5 83 c7 02 48 89 ee e8 a5 f6 2e 00 eb ac be 03 00 00 00 48 89 ef e8 26 9f 72 002
[  446.282727] RSP: 0018:ffffb325c06bfdf8 EFLAGS: 00010246
[  446.283099] RAX: 0000000000000000 RBX: ffff92717cabaf40 RCX: 0000000000000000
[  446.283578] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff92717cabaf40
[  446.284062] RBP: ffff92710cab4800 R08: 0000000000000000 R09: 0000000000000000
[  446.284545] R10: ffffb325c06bfdb0 R11: 0000000000000100 R12: ffff92717aedc580
[  446.285233] R13: ffff92717aedc580 R14: ffff927151ee5a00 R15: 0000000000000000
[  446.285840] FS:  0000000000000000(0000) GS:ffff927667c00000(0000) knlGS:0000000000000000
[  446.286412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  446.286819] CR2: 00007feacac32000 CR3: 0000000618020002 CR4: 0000000000770ef0
[  446.287310] PKRU: 55555554
[  446.287534] Call Trace:
[  446.287752]  <TASK>
[  446.287941]  ? __warn+0x80/0x120
[  446.288206]  ? __put_task_struct+0x10c/0x180
[  446.288524]  ? report_bug+0x164/0x190
[  446.288816]  ? handle_bug+0x41/0x70
[  446.289098]  ? exc_invalid_op+0x17/0x70
[  446.289392]  ? asm_exc_invalid_op+0x1a/0x20
[  446.289715]  ? __put_task_struct+0x10c/0x180
[  446.290038]  ? io_put_task_remote+0x80/0x90
[  446.290372]  __io_submit_flush_completions+0x2d6/0x390
[  446.290761]  io_fallback_req_func+0xad/0x140
[  446.291088]  process_one_work+0x189/0x3b0
[  446.291403]  worker_thread+0x277/0x390
[  446.291700]  ? __pfx_worker_thread+0x10/0x10
[  446.292018]  kthread+0xcf/0x100
[  446.292278]  ? __pfx_kthread+0x10/0x10
[  446.292562]  ret_from_fork+0x31/0x50
[  446.292848]  ? __pfx_kthread+0x10/0x10
[  446.293143]  ret_from_fork_asm+0x1a/0x30
[  446.293576]  </TASK>
[  446.293919] ---[ end trace 0000000000000000 ]---
[  446.294460] ------------[ cut here ]------------
[  446.294808] WARNING: CPU: 8 PID: 731 at kernel/fork.c:601 free_task+0x61/0x70
[  446.295294] Modules linked in: isofs binfmt_misc xfs vfat fat raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support virtio_net i2c_i801 net_fag
[  446.296901] CPU: 8 PID: 731 Comm: kworker/8:2 Tainted: G        W          6.8.0_io_uring_6.10+ #20
[  446.297521] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  446.298110] Workqueue: events io_fallback_req_func
[  446.298455] RIP: 0010:free_task+0x61/0x70
[  446.298756] Code: f3 ff f6 43 2e 20 75 18 48 89 df e8 49 7b 20 00 48 8b 3d e2 84 c7 02 48 89 de 5b e9 c9 f7 2e 00 48 89 df e8 61 70 03 000
[  446.303360] RSP: 0018:ffffb325c06bfe00 EFLAGS: 00010202
[  446.303745] RAX: 0000000000000001 RBX: ffff92717cabaf40 RCX: 0000000009a40008
[  446.304226] RDX: 0000000009a3e008 RSI: 000000000003b060 RDI: 6810a90e7192ffff
[  446.304763] RBP: ffff92710cab4800 R08: 0000000000000000 R09: 00000000820001df
[  446.305288] R10: 00000000820001df R11: 000000000000000d R12: ffff92717aedc580
[  446.305769] R13: ffff92717aedc580 R14: ffff927151ee5a00 R15: 0000000000000000
[  446.306251] FS:  0000000000000000(0000) GS:ffff927667c00000(0000) knlGS:0000000000000000
[  446.306815] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  446.307220] CR2: 00007feacac32000 CR3: 0000000618020002 CR4: 0000000000770ef0
[  446.307702] PKRU: 55555554
[  446.307924] Call Trace:
[  446.308135]  <TASK>
[  446.308322]  ? __warn+0x80/0x120
[  446.308637]  ? free_task+0x61/0x70
[  446.308926]  ? report_bug+0x164/0x190
[  446.309207]  ? handle_bug+0x41/0x70
[  446.309474]  ? exc_invalid_op+0x17/0x70
[  446.309767]  ? asm_exc_invalid_op+0x1a/0x20
[  446.310076]  ? free_task+0x61/0x70
[  446.310340]  __io_submit_flush_completions+0x2d6/0x390
[  446.310711]  io_fallback_req_func+0xad/0x140
[  446.311067]  process_one_work+0x189/0x3b0
[  446.311492]  worker_thread+0x277/0x390
[  446.311881]  ? __pfx_worker_thread+0x10/0x10
[  446.312205]  kthread+0xcf/0x100
[  446.312457]  ? __pfx_kthread+0x10/0x10
[  446.312750]  ret_from_fork+0x31/0x50
[  446.313028]  ? __pfx_kthread+0x10/0x10
[  446.313320]  ret_from_fork_asm+0x1a/0x30
[  446.313616]  </TASK>
[  446.313812] ---[ end trace 0000000000000000 ]---
[  446.314171] BUG: kernel NULL pointer dereference, address: 0000000000000098
[  446.314184] ------------[ cut here ]------------
[  446.314495] #PF: supervisor read access in kernel mode
[  446.314747] kernel BUG at mm/slub.c:553!
[  446.314986] #PF: error_code(0x0000) - not-present page
[  446.316032] PGD 0 P4D 0
[  446.316253] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  446.316573] CPU: 8 PID: 9914 Comm: ublk Tainted: G        W          6.8.0_io_uring_6.10+ #20
[  446.317167] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  446.317763] RIP: 0010:release_task+0x3b/0x560
[  446.318089] Code: 55 53 48 89 fb 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 44 24 20 31 c0 e8 d1 17 0b 00 e8 cc 17 0b 00 48 8b 83 b8 0bf
[  446.319301] RSP: 0018:ffffb325cdb2bca8 EFLAGS: 00010202
[  446.319672] RAX: 0000000000000000 RBX: ffff92717cabaf40 RCX: ffffb325cdb2bd18
[  446.320151] RDX: ffff92717cabaf40 RSI: ffff92717cabb948 RDI: ffff92717cabaf40
[  446.320628] RBP: ffffb325cdb2bd18 R08: ffffb325cdb2bd18 R09: 0000000000000000
[  446.321122] R10: 0000000000000001 R11: 0000000000000100 R12: ffffb325cdb2bd18
[  446.321706] R13: ffffb325cdb2b310 R14: ffffb325cdb2b310 R15: 0000000000000000
[  446.322188] FS:  0000000000000000(0000) GS:ffff927667c00000(0000) knlGS:0000000000000000
[  446.322742] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  446.323169] CR2: 0000000000000098 CR3: 0000000618020002 CR4: 0000000000770ef0
[  446.324013] PKRU: 55555554
[  446.324267] Call Trace:
[  446.324466]  <TASK>
[  446.324646]  ? __die+0x23/0x70
[  446.324881]  ? page_fault_oops+0x173/0x4f0
[  446.325175]  ? _raw_spin_unlock_irq+0xe/0x30
[  446.325475]  ? exc_page_fault+0x76/0x170
[  446.325752]  ? asm_exc_page_fault+0x26/0x30
[  446.326048]  ? release_task+0x3b/0x560
[  446.326321]  ? release_task+0x34/0x560
[  446.326589]  do_exit+0x6fd/0xad0
[  446.326832]  do_group_exit+0x30/0x80
[  446.327100]  get_signal+0x8de/0x8e0
[  446.327355]  arch_do_signal_or_restart+0x3e/0x240
[  446.327680]  syscall_exit_to_user_mode+0x167/0x210
[  446.328008]  do_syscall_64+0x96/0x170
[  446.328361]  ? syscall_exit_to_user_mode+0x60/0x210
[  446.328879]  ? do_syscall_64+0x96/0x170
[  446.329173]  entry_SYSCALL_64_after_hwframe+0x6c/0x74
[  446.329524] RIP: 0033:0x7feadc57e445
[  446.329796] Code: Unable to access opcode bytes at 0x7feadc57e41b.
[  446.330208] RSP: 002b:00007feadb3ffd38 EFLAGS: 00000202 ORIG_RAX: 00000000000001aa
[  446.330717] RAX: 0000000000000078 RBX: 0000000000000000 RCX: 00007feadc57e445
[  446.331184] RDX: 0000000000000001 RSI: 0000000000000078 RDI: 0000000000000000
[  446.331642] RBP: 00007feacc002ff8 R08: 00007feadb3ffd70 R09: 0000000000000018
[  446.332107] R10: 0000000000000019 R11: 0000000000000202 R12: 00007feadb3ffd90
[  446.332564] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000001aa
[  446.333022]  </TASK>
[  446.333208] Modules linked in: isofs binfmt_misc xfs vfat fat raid0 iTCO_wdt intel_pmc_bxt iTCO_vendor_support virtio_net i2c_i801 net_fag
[  446.334744] CR2: 0000000000000098
[  446.335008] ---[ end trace 0000000000000000 ]---



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-18  0:15                           ` Ming Lei
@ 2024-03-18  1:34                             ` Jens Axboe
  2024-03-18  1:44                               ` Jens Axboe
  2024-03-18  1:49                               ` Ming Lei
  0 siblings, 2 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-18  1:34 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block, Kanchan Joshi

On 3/17/24 6:15 PM, Ming Lei wrote:
> On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote:
>> On 3/17/24 4:07 PM, Jens Axboe wrote:
>>> On 3/17/24 3:51 PM, Jens Axboe wrote:
>>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
>>>>> On 3/17/24 21:34, Pavel Begunkov wrote:
>>>>>> On 3/17/24 21:32, Jens Axboe wrote:
>>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>>>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>>>>>
>>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>>>>>> reproduce it, blktests?
>>>>>>>>>>>
>>>>>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>>>>>
>>>>>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>>>>>
>>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>>>>>> be the reason.
>>>>>>>>>
>>>>>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>>>>>> haven't seen anything odd.
>>>>>>>>
>>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>>>>>> with the issue, and by full recompilation and config prompts I assumed
>>>>>>>> you pulled something in between (maybe not).
>>>>>>>>
>>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>>>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>>>>>> shows when you try to shutdown the VM after tests.
>>>>>>>
>>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>>>>>> the other times I ran it. In any case, once you repost I'll rebase and
>>>>>>> then let's see if it hits again.
>>>>>>>
>>>>>>> Did you run with KASAN enabled
>>>>>>
>>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
>>>>>
>>>>> And another note, I triggered it once (IIRC on shutdown) with ublk
>>>>> tests only w/o liburing/tests, likely limits it to either the core
>>>>> io_uring infra or non-io_uring bugs.
>>>>
>>>> Been running on for-6.10/io_uring, and the only odd thing I see is that
>>>> the test output tends to stall here:
>>>>
>>>> Running test read-before-exit.t
>>>>
>>>> which then either leads to a connection disconnect from my ssh into that
>>>> vm, or just a long delay and then it picks up again. This did not happen
>>>> with io_uring-6.9.
>>>>
>>>> Maybe related? At least it's something new. Just checked again, and yeah
>>>> it seems to totally lock up the vm while that is running. Will try a
>>>> quick bisect of that series.
>>>
>>> Seems to be triggered by the top of branch patch in there, my poll and
>>> timeout special casing. While the above test case runs with that commit,
>>> it'll freeze the host.
>>
>> Had a feeling this was the busy looping off cancelations, and flushing
>> the fallback task_work seems to fix it. I'll check more tomorrow.
>>
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index a2cb8da3cc33..f1d3c5e065e9 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
>>  	if (task)
>>  		ret |= io_run_task_work() > 0;
>> +	else if (ret)
>> +		flush_delayed_work(&ctx->fallback_work);
>>  	return ret;
>>  }
> 
> Still can trigger the warning with above patch:
> 
> [  446.275975] ------------[ cut here ]------------
> [  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180

And this is running that test case you referenced? I'll take a look, as
it seems related to the poll kill rather than the other patchset.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-18  1:34                             ` Jens Axboe
@ 2024-03-18  1:44                               ` Jens Axboe
  2024-03-18  1:49                               ` Ming Lei
  1 sibling, 0 replies; 54+ messages in thread
From: Jens Axboe @ 2024-03-18  1:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block, Kanchan Joshi

On 3/17/24 7:34 PM, Jens Axboe wrote:
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index a2cb8da3cc33..f1d3c5e065e9 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>>>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
>>>  	if (task)
>>>  		ret |= io_run_task_work() > 0;
>>> +	else if (ret)
>>> +		flush_delayed_work(&ctx->fallback_work);
>>>  	return ret;
>>>  }
>>
>> Still can trigger the warning with above patch:
>>
>> [  446.275975] ------------[ cut here ]------------
>> [  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
> 
> And this is running that test case you referenced? I'll take a look, as
> it seems related to the poll kill rather than the other patchset.

I can reproduce with that test too, and it triggers with v2 of the
patchset on top of io_uring-6.9, with and without that poll patch. Which
makes sense, as I doubt you're doing any poll or timeouts in there.

Pavel, passing that one to you then :)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: (subset) [PATCH 00/11] remove aux CQE caches
  2024-03-18  1:34                             ` Jens Axboe
  2024-03-18  1:44                               ` Jens Axboe
@ 2024-03-18  1:49                               ` Ming Lei
  1 sibling, 0 replies; 54+ messages in thread
From: Ming Lei @ 2024-03-18  1:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block, Kanchan Joshi

On Sun, Mar 17, 2024 at 07:34:30PM -0600, Jens Axboe wrote:
> On 3/17/24 6:15 PM, Ming Lei wrote:
> > On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote:
> >> On 3/17/24 4:07 PM, Jens Axboe wrote:
> >>> On 3/17/24 3:51 PM, Jens Axboe wrote:
> >>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
> >>>>> On 3/17/24 21:34, Pavel Begunkov wrote:
> >>>>>> On 3/17/24 21:32, Jens Axboe wrote:
> >>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
> >>>>>>>> On 3/17/24 21:24, Jens Axboe wrote:
> >>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
> >>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
> >>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
> >>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
> >>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>>>> The following two error can be triggered with this patchset
> >>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
> >>>>>>>>>>>>> such failures after reverting the 11 patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I
> >>>>>>>>>>>> reproduce it, blktests?
> >>>>>>>>>>>
> >>>>>>>>>>> Yeah, it needs yesterday's fix.
> >>>>>>>>>>>
> >>>>>>>>>>> You may need to run this test multiple times for triggering the problem:
> >>>>>>>>>>
> >>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
> >>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
> >>>>>>>>>> However, it seems the branch is buggy even without my patches, I
> >>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
> >>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also
> >>>>>>>>>> be the reason.
> >>>>>>>>>
> >>>>>>>>> Hmm odd, there's nothing in there but your series and then the
> >>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
> >>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
> >>>>>>>>> haven't seen anything odd.
> >>>>>>>>
> >>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
> >>>>>>>> with the issue, and by full recompilation and config prompts I assumed
> >>>>>>>> you pulled something in between (maybe not).
> >>>>>>>>
> >>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
> >>>>>>>> stack trace is usually some unmap or task exit, sometimes it only
> >>>>>>>> shows when you try to shutdown the VM after tests.
> >>>>>>>
> >>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
> >>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
> >>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
> >>>>>>> the other times I ran it. In any case, once you repost I'll rebase and
> >>>>>>> then let's see if it hits again.
> >>>>>>>
> >>>>>>> Did you run with KASAN enabled
> >>>>>>
> >>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
> >>>>>
> >>>>> And another note, I triggered it once (IIRC on shutdown) with ublk
> >>>>> tests only w/o liburing/tests, likely limits it to either the core
> >>>>> io_uring infra or non-io_uring bugs.
> >>>>
> >>>> Been running on for-6.10/io_uring, and the only odd thing I see is that
> >>>> the test output tends to stall here:
> >>>>
> >>>> Running test read-before-exit.t
> >>>>
> >>>> which then either leads to a connection disconnect from my ssh into that
> >>>> vm, or just a long delay and then it picks up again. This did not happen
> >>>> with io_uring-6.9.
> >>>>
> >>>> Maybe related? At least it's something new. Just checked again, and yeah
> >>>> it seems to totally lock up the vm while that is running. Will try a
> >>>> quick bisect of that series.
> >>>
> >>> Seems to be triggered by the top of branch patch in there, my poll and
> >>> timeout special casing. While the above test case runs with that commit,
> >>> it'll freeze the host.
> >>
> >> Had a feeling this was the busy looping off cancelations, and flushing
> >> the fallback task_work seems to fix it. I'll check more tomorrow.
> >>
> >>
> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >> index a2cb8da3cc33..f1d3c5e065e9 100644
> >> --- a/io_uring/io_uring.c
> >> +++ b/io_uring/io_uring.c
> >> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> >>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
> >>  	if (task)
> >>  		ret |= io_run_task_work() > 0;
> >> +	else if (ret)
> >> +		flush_delayed_work(&ctx->fallback_work);
> >>  	return ret;
> >>  }
> > 
> > Still can trigger the warning with above patch:
> > 
> > [  446.275975] ------------[ cut here ]------------
> > [  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
> 
> And this is running that test case you referenced? I'll take a look, as
> it seems related to the poll kill rather than the other patchset.

Yeah, and now I am running 'git bisect' on Pavel's V2.

thanks,
Ming


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2024-03-18  1:49 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
2024-03-15 15:29 ` [PATCH 01/11] io_uring: fix poll_remove stalled req completion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 02/11] io_uring/cmd: kill one issue_flags to tw conversion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 03/11] io_uring/cmd: fix tw <-> issue_flags conversion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 04/11] io_uring/cmd: introduce io_uring_cmd_complete Pavel Begunkov
2024-03-15 15:29 ` [PATCH 05/11] ublk: don't hard code IO_URING_F_UNLOCKED Pavel Begunkov
2024-03-15 15:29 ` [PATCH 06/11] nvme/io_uring: " Pavel Begunkov
2024-03-15 15:29 ` [PATCH 07/11] io_uring/rw: avoid punting to io-wq directly Pavel Begunkov
2024-03-15 15:29 ` [PATCH 08/11] io_uring: force tw ctx locking Pavel Begunkov
2024-03-15 15:40   ` Jens Axboe
2024-03-15 16:14     ` Pavel Begunkov
2024-03-15 15:29 ` [PATCH 09/11] io_uring: remove struct io_tw_state::locked Pavel Begunkov
2024-03-15 15:30 ` [PATCH 10/11] io_uring: refactor io_fill_cqe_req_aux Pavel Begunkov
2024-03-15 15:30 ` [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches Pavel Begunkov
2024-03-15 16:20   ` Jens Axboe
2024-03-15 16:23     ` Pavel Begunkov
2024-03-15 16:25       ` Jens Axboe
2024-03-15 16:27         ` Jens Axboe
2024-03-15 16:44           ` Pavel Begunkov
2024-03-15 16:49             ` Jens Axboe
2024-03-15 17:26               ` Pavel Begunkov
2024-03-15 18:26                 ` Jens Axboe
2024-03-15 18:51                   ` Pavel Begunkov
2024-03-15 19:02                     ` Jens Axboe
2024-03-15 16:29         ` Pavel Begunkov
2024-03-15 16:33           ` Jens Axboe
2024-03-15 15:42 ` [PATCH 00/11] remove aux CQE caches Jens Axboe
2024-03-15 16:00 ` Jens Axboe
2024-03-15 22:53 ` (subset) " Jens Axboe
2024-03-16  2:03   ` Ming Lei
2024-03-16  2:24     ` Ming Lei
2024-03-16  2:54       ` Pavel Begunkov
2024-03-16  3:54         ` Ming Lei
2024-03-16  4:13           ` Pavel Begunkov
2024-03-16  4:20             ` Pavel Begunkov
2024-03-16  9:53               ` Ming Lei
2024-03-16 11:52   ` Ming Lei
2024-03-16 13:27     ` Pavel Begunkov
2024-03-16 13:56       ` Ming Lei
2024-03-17 20:55         ` Pavel Begunkov
2024-03-17 21:24           ` Jens Axboe
2024-03-17 21:29             ` Pavel Begunkov
2024-03-17 21:32               ` Jens Axboe
2024-03-17 21:34                 ` Pavel Begunkov
2024-03-17 21:47                   ` Pavel Begunkov
2024-03-17 21:51                     ` Jens Axboe
2024-03-17 22:07                       ` Jens Axboe
2024-03-17 22:24                         ` Jens Axboe
2024-03-18  0:15                           ` Ming Lei
2024-03-18  1:34                             ` Jens Axboe
2024-03-18  1:44                               ` Jens Axboe
2024-03-18  1:49                               ` Ming Lei
2024-03-17 23:16                       ` Pavel Begunkov
2024-03-16 14:39       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox