On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote: >On 9/2/22 10:06 AM, Jens Axboe wrote: >> On 9/2/22 9:16 AM, Kanchan Joshi wrote: >>> Hi, >>> >>> Currently uring-cmd lacks the ability to leverage the pre-registered >>> buffers. This series adds the support in uring-cmd, and plumbs >>> nvme passthrough to work with it. >>> >>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS >>> in my setup. >>> >>> Without fixedbufs >>> ***************** >>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1 >>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1 >>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128 >>> Engine=io_uring, sq_ring=128, cq_ring=128 >>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31 >>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32 >>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32 >>> ^CExiting on signal >>> Maximum IOPS=1.85M >> >> With the poll support queued up, I ran this one as well. tldr is: >> >> bdev (non pt) 122M IOPS >> irq driven 51-52M IOPS >> polled 71M IOPS >> polled+fixed 78M IOPS except first one, rest three entries are for passthru? somehow I didn't see that big of a gap. I will try to align my setup in coming days. >> Looking at profiles, it looks like the bio is still being allocated >> and freed and not dipping into the alloc cache, which is using a >> substantial amount of CPU. I'll poke a bit and see what's going on... > >It's using the fs_bio_set, and that doesn't have the PERCPU alloc cache >enabled. With the below, we then do: Thanks for the find. >polled+fixed 82M > >I suspect the remainder is due to the lack of batching on the request >freeing side, at least some of it. Haven't really looked deeper yet. > >One issue I saw - try and use passthrough polling without having any >poll queues defined and it'll stall just spinning on completions. You >need to ensure that these are processed as well - look at how the >non-passthrough io_uring poll path handles it. Had tested this earlier, and it used to run fine. And it does not now. I see that io are getting completed, irq-completion is arriving in nvme and it is triggering task-work based completion (by calling io_uring_cmd_complete_in_task). But task-work never got called and therefore no completion happened. io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add Seems task work did not get added. Something about newly added IORING_SETUP_DEFER_TASKRUN changes the scenario. static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local) { struct io_uring_task *tctx = req->task->io_uring; struct io_ring_ctx *ctx = req->ctx; struct llist_node *node; if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) { io_req_local_work_add(req); return; } .... To confirm, I commented that in t/io_uring and it runs fine. Please see if that changes anything for you? I will try to find the actual fix tomorow. diff --git a/t/io_uring.c b/t/io_uring.c index d893b7b2..ac5f60e0 100644 --- a/t/io_uring.c +++ b/t/io_uring.c @@ -460,7 +460,6 @@ static int io_uring_setup(unsigned entries, struct io_uring_params *p) p->flags |= IORING_SETUP_COOP_TASKRUN; p->flags |= IORING_SETUP_SINGLE_ISSUER; - p->flags |= IORING_SETUP_DEFER_TASKRUN; retry: ret = syscall(__NR_io_uring_setup, entries, p); if (!ret)