public inbox for [email protected]
 help / color / mirror / Atom feed
From: Keith Busch <[email protected]>
To: Dave Chinner <[email protected]>
Cc: Al Viro <[email protected]>, Keith Busch <[email protected]>,
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]
Subject: Re: [PATCH 4/5] io_uring: add support for dma pre-mapping
Date: Thu, 28 Jul 2022 07:25:38 -0600	[thread overview]
Message-ID: <YuKOUh6MJGGFuIm/@kbusch-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <[email protected]>

On Thu, Jul 28, 2022 at 12:35:11PM +1000, Dave Chinner wrote:
> On Wed, Jul 27, 2022 at 05:00:09PM -0600, Keith Busch wrote:
> > The block_device driver has to opt-in to this feature. If a multi-device block
> > driver wants to opt-in to this, then it would be responsible to handle
> > translating that driver's specific cookie to whatever representation the
> > drivers it stacks atop require. Otherwise, the cookie threaded through the bio
> > is an opque value: nothing between io_uring and the block_device driver need to
> > decode it.
> 
> I'm not talking about "multi-device" block devices like we build
> with DM or MD to present a single stacked block device to the
> filesystem. I'm talking about the fact that both btrfs and XFS
> support multiple *independent* block devices in the one filesystem.
> 
> i.e.:
> 
> # mkfs.xfs -r rtdev=/dev/nvme0n1 -l logdev=/dev/nvme1n1,size=2000m /dev/nvme2n1
> meta-data=/dev/nvme2n1           isize=512    agcount=4, agsize=22893287 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=0    bigtime=1 inobtcount=1 nrext64=0
> data     =                       bsize=4096   blocks=91573146, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =/dev/nvme1n1           bsize=4096   blocks=512000, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =/dev/nvme0n1           extsz=4096   blocks=91573146, rtextents=91573146
> #
> 
> This builds an XFS filesystem which can write file data to either
> /dev/nvme0n1 or /dev/nvme2n1, and journal IO will get sent to a
> third block dev (/dev/nvme1n1).
> 
> So, which block device do we map for the DMA buffers that contain
> the file data for any given file in that filesystem? There is no
> guarantee that is is sb->s_bdev, because it only points at one of
> the two block devices that can contain file data.
> 
> Btrfs is similar, but it might stripe data across /dev/nvme0n1,
> /dev/nvme1n1 and /dev/nvme2n1 for a single file writes (and hence
> reads) and so needs separate DMA mappings for each block device just
> to do IO direct to/from one file....
> 
> Indeed, for XFS there's no requirement that the block devices have
> the same capabilities or even storage types - the rtdev could be
> spinning disks, the logdev an nvme SSD, and the datadev is pmem. If
> XFs has to do something special, it queries the bdev it needs to
> operate on (e.g. DAX mappings are only allowed on pmem based
> devices).
> 
> Hence it is invalid to assume that sb->s_bdev points at the actual
> block device the data for any given regular file is stored on. It is
> also invalid to assume the characteristics of the device in
> sb->s_bdev are common for all files in the filesystem.
> 
> IOWs, the only way you can make something like this work via
> filesystem mapping infrastructure to translate file offset to
> to a {dev, dev_offset} tuple to tell you what persistently mapped
> device buffers you need to use for IO to the given file {offset,len}
> range that IO needs to be done on....

Thank you for the explanation. I understand now, sorry for my previous
misunderstanding.

I may consider just initially supporting direct raw block devices if I can't
find a viable solution quick enough.

  reply	other threads:[~2022-07-28 13:25 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-26 17:38 [PATCH 0/5] dma mapping optimisations Keith Busch
2022-07-26 17:38 ` [PATCH 1/5] blk-mq: add ops to dma map bvec Keith Busch
2022-07-26 17:38 ` [PATCH 2/5] iov_iter: introduce type for preregistered dma tags Keith Busch
2022-07-26 23:10   ` Al Viro
2022-07-27 13:52     ` Keith Busch
2022-07-26 17:38 ` [PATCH 3/5] block: add dma tag bio type Keith Busch
2022-07-26 17:38 ` [PATCH 4/5] io_uring: add support for dma pre-mapping Keith Busch
2022-07-26 23:12   ` Al Viro
2022-07-27 13:58     ` Keith Busch
2022-07-27 14:04       ` Al Viro
2022-07-27 15:04         ` Keith Busch
2022-07-27 22:32           ` Dave Chinner
2022-07-27 23:00             ` Keith Busch
2022-07-28  2:35               ` Dave Chinner
2022-07-28 13:25                 ` Keith Busch [this message]
2022-07-27 14:11   ` Al Viro
2022-07-27 14:48     ` Keith Busch
2022-07-27 15:26       ` Al Viro
2022-07-26 17:38 ` [PATCH 5/5] nvme-pci: implement dma_map support Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuKOUh6MJGGFuIm/@kbusch-mbp.dhcp.thefacebook.com \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox