On 9/2/22 1:32 PM, Jens Axboe wrote: > On 9/2/22 12:46 PM, Kanchan Joshi wrote: >> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote: >>> On 9/2/22 10:06 AM, Jens Axboe wrote: >>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote: >>>>> Hi, >>>>> >>>>> Currently uring-cmd lacks the ability to leverage the pre-registered >>>>> buffers. This series adds the support in uring-cmd, and plumbs >>>>> nvme passthrough to work with it. >>>>> >>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS >>>>> in my setup. >>>>> >>>>> Without fixedbufs >>>>> ***************** >>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1 >>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1 >>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128 >>>>> Engine=io_uring, sq_ring=128, cq_ring=128 >>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31 >>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32 >>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32 >>>>> ^CExiting on signal >>>>> Maximum IOPS=1.85M >>>> >>>> With the poll support queued up, I ran this one as well. tldr is: >>>> >>>> bdev (non pt)??? 122M IOPS >>>> irq driven??? 51-52M IOPS >>>> polled??????? 71M IOPS >>>> polled+fixed??? 78M IOPS >> >> except first one, rest three entries are for passthru? somehow I didn't >> see that big of a gap. I will try to align my setup in coming days. > > Right, sorry it was badly labeled. First one is bdev with polling, > registered buffers, etc. The others are all the passthrough mode. polled > goes to 74M with the caching fix, so it's about a 74M -> 82M bump using > registered buffers with passthrough and polling. > >>> polled+fixed??? 82M >>> >>> I suspect the remainder is due to the lack of batching on the request >>> freeing side, at least some of it. Haven't really looked deeper yet. >>> >>> One issue I saw - try and use passthrough polling without having any >>> poll queues defined and it'll stall just spinning on completions. You >>> need to ensure that these are processed as well - look at how the >>> non-passthrough io_uring poll path handles it. >> >> Had tested this earlier, and it used to run fine. And it does not now. >> I see that io are getting completed, irq-completion is arriving in nvme >> and it is triggering task-work based completion (by calling >> io_uring_cmd_complete_in_task). But task-work never got called and >> therefore no completion happened. >> >> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add >> >> Seems task work did not get added. Something about newly added >> IORING_SETUP_DEFER_TASKRUN changes the scenario. >> >> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local) >> { >> ?????? struct io_uring_task *tctx = req->task->io_uring; >> ?????? struct io_ring_ctx *ctx = req->ctx; >> ?????? struct llist_node *node; >> >> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) { >> ?????????????? io_req_local_work_add(req); >> ?????????????? return; >> ?????? } >> ????.... >> >> To confirm, I commented that in t/io_uring and it runs fine. >> Please see if that changes anything for you? I will try to find the >> actual fix tomorow. > > Ah gotcha, yes that actually makes a lot of sense. I wonder if regular > polling is then also broken without poll queues if > IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into > io_iopoll_check(). A mix of fixes and just cleanups, here's what I got. -- Jens Axboe