On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>On 9/2/22 10:06 AM, Jens Axboe wrote:
>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>> Hi,
>>>
>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>> buffers. This series adds the support in uring-cmd, and plumbs
>>> nvme passthrough to work with it.
>>>
>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>> in my setup.
>>>
>>> Without fixedbufs
>>> *****************
>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>> ^CExiting on signal
>>> Maximum IOPS=1.85M
>>
>> With the poll support queued up, I ran this one as well. tldr is:
>>
>> bdev (non pt)	122M IOPS
>> irq driven	51-52M IOPS
>> polled		71M IOPS
>> polled+fixed	78M IOPS

except first one, rest three entries are for passthru? somehow I didn't
see that big of a gap. I will try to align my setup in coming days.

>> Looking at profiles, it looks like the bio is still being allocated
>> and freed and not dipping into the alloc cache, which is using a
>> substantial amount of CPU. I'll poke a bit and see what's going on...
>
>It's using the fs_bio_set, and that doesn't have the PERCPU alloc cache
>enabled. With the below, we then do:

Thanks for the find.

>polled+fixed	82M
>
>I suspect the remainder is due to the lack of batching on the request
>freeing side, at least some of it. Haven't really looked deeper yet.
>
>One issue I saw - try and use passthrough polling without having any
>poll queues defined and it'll stall just spinning on completions. You
>need to ensure that these are processed as well - look at how the
>non-passthrough io_uring poll path handles it.

Had tested this earlier, and it used to run fine. And it does not now.
I see that io are getting completed, irq-completion is arriving in nvme
and it is triggering task-work based completion (by calling
io_uring_cmd_complete_in_task). But task-work never got called and
therefore no completion happened.

io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add

Seems task work did not get added. Something about newly added
IORING_SETUP_DEFER_TASKRUN changes the scenario.

static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
{
        struct io_uring_task *tctx = req->task->io_uring;
        struct io_ring_ctx *ctx = req->ctx;
        struct llist_node *node;

        if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
                io_req_local_work_add(req);
                return;
        }
	....

To confirm, I commented that in t/io_uring and it runs fine.
Please see if that changes anything for you? I will try to find the
actual fix tomorow.

diff --git a/t/io_uring.c b/t/io_uring.c
index d893b7b2..ac5f60e0 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -460,7 +460,6 @@ static int io_uring_setup(unsigned entries, struct io_uring_params *p)

        p->flags |= IORING_SETUP_COOP_TASKRUN;
        p->flags |= IORING_SETUP_SINGLE_ISSUER;
-       p->flags |= IORING_SETUP_DEFER_TASKRUN;
 retry:
        ret = syscall(__NR_io_uring_setup, entries, p);
        if (!ret)