Re: [RFC PATCH 00/12] io_uring attached nvme queue

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Kanchan Joshi <[email protected]>
To: Jens Axboe <[email protected]>
Cc: Kanchan Joshi <[email protected]>,
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]
Subject: Re: [RFC PATCH 00/12] io_uring attached nvme queue
Date: Mon, 1 May 2023 17:06:24 +0530	[thread overview]
Message-ID: <CA+1E3r+J5pAywR8p9h=seJ+b=ckY27qsnrG0O_9iU2F+LwDnqA@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>

On Sat, Apr 29, 2023 at 10:55 PM Jens Axboe <[email protected]> wrote:
>
> On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> > This series shows one way to do what the title says.
> > This puts up a more direct/lean path that enables
> >  - submission from io_uring SQE to NVMe SQE
> >  - completion from NVMe CQE to io_uring CQE
> > Essentially cutting the hoops (involving request/bio) for nvme io path.
> >
> > Also, io_uring ring is not to be shared among application threads.
> > Application is responsible for building the sharing (if it feels the
> > need). This means ring-associated exclusive queue can do away with some
> > synchronization costs that occur for shared queue.
> >
> > Primary objective is to amp up of efficiency of kernel io path further
> > (towards PCIe gen N, N+1 hardware).
> > And we are seeing some asks too [1].
> >
> > Building-blocks
> > ===============
> > At high level, series can be divided into following parts -
> >
> > 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> > be attached to other in-kernel user (not just to block-layer, which is
> > the case at the moment) on demand.
> >
> > Example:
> > insmod nvme.ko poll_queus=1 raw_queues=2
> >
> > nvme0: 24/0/1/2 default/read/poll queues/raw queues
> >
> > While driver registers other queues with block-layer, raw-queues are
> > rather reserved for exclusive attachment with other in-kernel users.
> > At this point, each raw-queue is interrupt-disabled (similar to
> > poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> > [Refer: patch 2]
> >
> > 2. register/unregister queue interface
> > (a) one for io_uring application to ask for device-queue and register
> > with the ring. [Refer: patch 4]
> > (b) another at nvme so that other in-kernel users (io_uring for now) can
> > ask for a raw-queue. [Refer: patch 3, 5, 6]
> >
> > The latter returns a qid, that io_uring stores internally (not exposed
> > to user-space) in the ring ctx. At max one queue per ring is enabled.
> > Ring has no other special properties except the fact that it stores a
> > qid that it can use exclusively. So application can very well use the
> > ring to do other things than nvme io.
> >
> > 3. user-interface to send commands down this way
> > (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> > that application passes in the SQE. That is all.
> > (b) the flag goes down to provider of ->uring_cmd which may choose to do
> >   things differently based on it (or ignore it).
> > [Refer: patch 7]
> >
> > 4. nvme uring-cmd understands the above flag. It submits the command
> > into the known pre-registered queue, and completes (polled-completion)
> > from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> > done directly without building other intermediate constructs.
> > [Refer: patch 8, 10, 12]
> >
> > Testing and Performance
> > =======================
> > fio and t/io_uring is modified to exercise this path.
> > - fio: new "registerqueues" option
> > - t/io_uring: new "k" option
> >
> > Good part:
> > 2.96M -> 5.02M
> >
> > nvme io (without this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> > submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> > IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> > IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=2.96M
> >
> > nvme io (with this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Not so good part:
> > While single IO is fast this way, we do not have batching abilities for
> > multi-io scenario. Plugging, submission and completion batching are tied to
> > block-layer constructs. Things should look better if we could do something
> > about that.
> > Particularly something is off with the completion-batching.
> >
> > With -s32 and -c32, the numbers decline:
> >
> > # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> > Exiting on timeout
> > Maximum IOPS=3.71M
> >
> > And perf gets restored if we go back to -c2
> >
> > # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Source
> > ======
> > Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> > fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> >
> > Please take a look.
>
> This looks like a great starting point! Unfortunately I won't be at
> LSFMM this year to discuss it in person, but I'll be taking a closer
> look at this.

That will help, thanks.

> Some quick initial reactions:
>
> - I'd call them "user" queues rather than raw or whatever, I think that
>   more accurately describes what they are for.

Right, that is better.

> - I guess there's no way around needing to pre-allocate these user
>   queues, just like we do for polled_queues right now?

Right, we would need to allocate nvme sq/cq in the outset.
Changing the count at run-time is a bit murky. I will have another look though.

>In terms of user
>   API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
>   right name here...) and it'd allocate and return you an ID.

But this is the implemented API (new register code in io_uring) in the
patchset at the moment.
So it seems I am missing your point?

> - Need to take a look at the uring_cmd stuff again, but would be nice if
>   we did not have to add more stuff to fops for this. Maybe we can set
>   aside a range of "ioctl" type commands through uring_cmd for this
>   instead, and go that way for registering/unregistering queues.

Yes, I see your point in not having to add new fops.
But, a new uring_cmd opcode is only at the nvme-level.
It is a good way to allocate/deallocate a nvme queue, but it cannot
attach that with the io_uring's ring.
Or do you have a different view? Seems this is connected to the previous point.

> We do have some users that are CPU constrained, and while my testing
> easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
> path, that's also with all the fat that adds overhead removed. Most
> people don't have this luxury, necessarily, or actually need some of
> this fat for their monitoring, for example. This would provide a nice
> way to have pretty consistent and efficient performance across distro
> type configs, which would be great, while still retaining the fattier
> bits for "normal" IO.
Makes total sense.

     prev parent reply	other threads:[~2023-05-01 11:36 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230429094228epcas5p4a80d8ed77433989fa804ecf449f83b0b@epcas5p4.samsung.com>
2023-04-29  9:39 ` [RFC PATCH 00/12] io_uring attached nvme queue Kanchan Joshi
     [not found]   ` <CGME20230429094238epcas5p4efa3dc785fa54ab974852c7f90113025@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 01/12] nvme: refactor nvme_alloc_io_tag_set Kanchan Joshi
     [not found]   ` <CGME20230429094240epcas5p1a7411f266412244115411b05da509e4a@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 02/12] pci: enable "raw_queues = N" module parameter Kanchan Joshi
     [not found]   ` <CGME20230429094243epcas5p13be3ca62dc2b03299d09cafaf11923c1@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 03/12] fs, block: interface to register/unregister the raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094245epcas5p2843abc5cd54ffe301d36459543bcd228@epcas5p2.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 04/12] io_uring, fs: plumb support to register/unregister raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094247epcas5p333e0f515000de60fb64dc2590cf9fcd8@epcas5p3.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 05/12] nvme: wire-up register/unregister queue f_op callback Kanchan Joshi
     [not found]   ` <CGME20230429094249epcas5p18bd717f4e34077c0fcf28458f11de8d1@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 06/12] pci: implement register/unregister functionality Kanchan Joshi
     [not found]   ` <CGME20230429094251epcas5p144d042853e10f090e3119338c2306546@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 07/12] io_uring: support for using registered queue in uring-cmd Kanchan Joshi
     [not found]   ` <CGME20230429094253epcas5p3cfff90e1c003b6fc9c7c4a61287beecb@epcas5p3.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 08/12] block: add mq_ops to submit and complete commands from raw-queue Kanchan Joshi
     [not found]   ` <CGME20230429094255epcas5p11bcbe76772289f27c41a50ce502c998d@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 09/12] nvme: carve out a helper to prepare nvme_command from ioucmd->cmd Kanchan Joshi
     [not found]   ` <CGME20230429094257epcas5p463574920bba26cd219275e57c2063d85@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 10/12] nvme: submisssion/completion of uring_cmd to/from the registered queue Kanchan Joshi
     [not found]   ` <CGME20230429094259epcas5p11f0f3422eb4aa4e3ebf00e0666790efa@epcas5p1.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 11/12] pci: modify nvme_setup_prp_simple parameters Kanchan Joshi
     [not found]   ` <CGME20230429094301epcas5p48cf45da2f83d9ca8140ee777c7446d11@epcas5p4.samsung.com>
2023-04-29  9:39     ` [RFC PATCH 12/12] pci: implement submission/completion for rawq commands Kanchan Joshi
2023-04-29 17:17   ` [RFC PATCH 00/12] io_uring attached nvme queue Jens Axboe
2023-05-01 11:36     ` Kanchan Joshi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+1E3r+J5pAywR8p9h=seJ+b=ckY27qsnrG0O_9iU2F+LwDnqA@mail.gmail.com' \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox