On 9/2/22 1:32 PM, Jens Axboe wrote:
> On 9/2/22 12:46 PM, Kanchan Joshi wrote:
>> On Fri, Sep 02, 2022 at 10:32:16AM -0600, Jens Axboe wrote:
>>> On 9/2/22 10:06 AM, Jens Axboe wrote:
>>>> On 9/2/22 9:16 AM, Kanchan Joshi wrote:
>>>>> Hi,
>>>>>
>>>>> Currently uring-cmd lacks the ability to leverage the pre-registered
>>>>> buffers. This series adds the support in uring-cmd, and plumbs
>>>>> nvme passthrough to work with it.
>>>>>
>>>>> Using registered-buffers showed peak-perf hike from 1.85M to 2.17M IOPS
>>>>> in my setup.
>>>>>
>>>>> Without fixedbufs
>>>>> *****************
>>>>> # taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>>>>> submitter=0, tid=5256, file=/dev/ng0n1, node=-1
>>>>> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
>>>>> Engine=io_uring, sq_ring=128, cq_ring=128
>>>>> IOPS=1.85M, BW=904MiB/s, IOS/call=32/31
>>>>> IOPS=1.85M, BW=903MiB/s, IOS/call=32/32
>>>>> IOPS=1.85M, BW=902MiB/s, IOS/call=32/32
>>>>> ^CExiting on signal
>>>>> Maximum IOPS=1.85M
>>>>
>>>> With the poll support queued up, I ran this one as well. tldr is:
>>>>
>>>> bdev (non pt)??? 122M IOPS
>>>> irq driven??? 51-52M IOPS
>>>> polled??????? 71M IOPS
>>>> polled+fixed??? 78M IOPS
>>
>> except first one, rest three entries are for passthru? somehow I didn't
>> see that big of a gap. I will try to align my setup in coming days.
> 
> Right, sorry it was badly labeled. First one is bdev with polling,
> registered buffers, etc. The others are all the passthrough mode. polled
> goes to 74M with the caching fix, so it's about a 74M -> 82M bump using
> registered buffers with passthrough and polling.
> 
>>> polled+fixed??? 82M
>>>
>>> I suspect the remainder is due to the lack of batching on the request
>>> freeing side, at least some of it. Haven't really looked deeper yet.
>>>
>>> One issue I saw - try and use passthrough polling without having any
>>> poll queues defined and it'll stall just spinning on completions. You
>>> need to ensure that these are processed as well - look at how the
>>> non-passthrough io_uring poll path handles it.
>>
>> Had tested this earlier, and it used to run fine. And it does not now.
>> I see that io are getting completed, irq-completion is arriving in nvme
>> and it is triggering task-work based completion (by calling
>> io_uring_cmd_complete_in_task). But task-work never got called and
>> therefore no completion happened.
>>
>> io_uring_cmd_complete_in_task -> io_req_task_work_add -> __io_req_task_work_add
>>
>> Seems task work did not get added. Something about newly added
>> IORING_SETUP_DEFER_TASKRUN changes the scenario.
>>
>> static inline void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
>> {
>> ?????? struct io_uring_task *tctx = req->task->io_uring;
>> ?????? struct io_ring_ctx *ctx = req->ctx;
>> ?????? struct llist_node *node;
>>
>> ?????? if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
>> ?????????????? io_req_local_work_add(req);
>> ?????????????? return;
>> ?????? }
>> ????....
>>
>> To confirm, I commented that in t/io_uring and it runs fine.
>> Please see if that changes anything for you? I will try to find the
>> actual fix tomorow.
> 
> Ah gotcha, yes that actually makes a lot of sense. I wonder if regular
> polling is then also broken without poll queues if
> IORING_SETUP_DEFER_TASKRUN is set. It should be, I'll check into
> io_iopoll_check().

A mix of fixes and just cleanups, here's what I got.

-- 
Jens Axboe