public inbox for [email protected]
 help / color / mirror / Atom feed
From: Dave Chinner <[email protected]>
To: Christoph Hellwig <[email protected]>
Cc: Pierre Labat <[email protected]>, Keith Busch <[email protected]>,
	Kanchan Joshi <[email protected]>,
	Keith Busch <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>,
	"[email protected]" <[email protected]>
Subject: Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
Date: Thu, 14 Nov 2024 10:51:09 +1100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On Wed, Nov 13, 2024 at 05:47:36AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 06:18:21PM +0000, Pierre Labat wrote:
> > About 2)
> > Provide a simple way to the user to decide which layer generate write hints.
> > As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
> > Both use cases are legit, I think.
> > To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
> 
> The file system is, and always has been, the entity in charge of
> resource allocation of the underlying device.  Bypassing it will get
> you in trouble, and a simple mount option isn't really changing that
> (it's also not exactly a scalable interface).
> 
> If an application wants to micro-manage placement decisions it should not
> use a file system, or at least not a normal one with Posix semantics.
> That being said we'd demonstrated that applications using proper grouping
> of data by file and the simple temperature hints can get very good result
> from file systems that can interpret them, without a lot of work in the
> file system.  I suspect for most applications that actually want files
> that is actually going to give better results than trying to do the
> micro-management that tries to bypass the file system.

This.

The most important thing that filesystems do behind the scenes is
manage -data locality-. XFS has thousands of lines of code to manage
and control data locality - the allocation policy API itself has a
*dozens* control parameters. We have 2 separate allocation
architectures (one btree based, one bitmap based) and multiple
locality policy algorithms. These juggled physical alignment, size
granularity, size limits, data type being allocated for, desired
locality targets, different search algorithms (e.g. first fit, best
fit, exact fit by size or location, etc), multiple fallback
strategies when the initial target cannot be met, etc.

Allocation policy management is the core of every block based
filesystem that has ever been written.

Specifically to this "stream hint" discussion: go look at the XFS
filestreams allocator.

SGI wrote an entirely new allocator for XFS whose only purpose in
life is to automatically separate individual streams of user data
into physically separate regions of LBA space.

This was written to optimise realtime ingest and playback of
multiple uncompressed 4k and 8k video data streams from big
isochronous SAN storage arrays back in ~2005.  Each stream could be
up to 1.2GB/s of data. If the data for each IO was not exactly
placed in alignment with the storage array stripe cache granularity
(2MB, IIRC), then a cache miss would occur and the IO latency would
be too high and frames of data would be missed/dropped.

IOWs, we have an allocator in XFS that specifically designed to
separate indepedent streams of data to independent regions of the
filesystem LBA space to effcient support data IO rates in the order
of tens of GB/s.

What are we talking about now? Storage hardware that might be able
to do 10-15GB/s of IO that needs stream separation for efficient
management of the internal storage resources.

The fact we have previously solved this class of stream separation
problem at the filesystem level *without needing a user-controlled
API at all* is probably the most relevant fact missing from this
discussion.

As to the concern about stream/temp/hint translation consistency
across different hardware: the filesystem is the perfect place to
provide this abstraction to users. The block device can expose what
it supports, the user API can be fixed, and the filesystem can
provide the mapping between the two that won't change for the life
of the filesystem...

Long story short: Christoph is right.

The OS hints/streams API needs to be aligned to the capabilities
that filesystems already provide *as a primary design goal*. What
the new hardware might support is a secondary concern. i.e. hardware
driven software design is almost always a mistake: define the user
API and abstractions first, then the OS can reduce it sanely down to
what the specific hardware present is capable of supporting.

-Dave.
-- 
Dave Chinner
[email protected]

  reply	other threads:[~2024-11-13 23:51 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
2024-11-08 19:36 ` [PATCHv11 1/9] block: use generic u16 for write hints Keith Busch
2024-11-08 19:36 ` [PATCHv11 2/9] block: introduce max_write_hints queue limit Keith Busch
2024-11-08 19:36 ` [PATCHv11 3/9] statx: add write hint information Keith Busch
2024-11-08 19:36 ` [PATCHv11 4/9] block: allow ability to limit partition write hints Keith Busch
2024-11-08 19:36 ` [PATCHv11 5/9] block, fs: add write hint to kiocb Keith Busch
2024-11-08 19:36 ` [PATCHv11 6/9] io_uring: enable per-io hinting capability Keith Busch
2024-11-08 19:36 ` [PATCHv11 7/9] block: export placement hint feature Keith Busch
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
2024-11-11 16:27   ` Keith Busch
2024-11-11 16:34     ` Christoph Hellwig
2024-11-12 13:26   ` Kanchan Joshi
2024-11-12 13:34     ` Christoph Hellwig
2024-11-12 14:25       ` Keith Busch
2024-11-12 16:50         ` Christoph Hellwig
2024-11-12 17:19           ` Christoph Hellwig
2024-11-12 18:18         ` [EXT] " Pierre Labat
2024-11-13  4:47           ` Christoph Hellwig
2024-11-13 23:51             ` Dave Chinner [this message]
2024-11-14  3:09               ` Martin K. Petersen
2024-11-14  6:07               ` Christoph Hellwig
2024-11-15 16:28                 ` Keith Busch
2024-11-15 16:53                   ` Christoph Hellwig
2024-11-18 23:37                     ` Keith Busch
2024-11-19  7:15                       ` Christoph Hellwig
2024-11-20 17:21                         ` Darrick J. Wong
2024-11-20 18:11                           ` Keith Busch
2024-11-21  7:17                             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox