* [PATCHv11 0/9] write hints with nvme fdp and scsi streams
@ 2024-11-08 19:36 Keith Busch
2024-11-08 19:36 ` [PATCHv11 1/9] block: use generic u16 for write hints Keith Busch
` (7 more replies)
0 siblings, 8 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
Changes from v10:
Fixed FDP max handle size calculations (wrong type)
Defined and used FDP constants instead of literal numbers
Moved io_uring write_hint to the end of the SQE so as not to overlap
with other defined fields except uring_cmd
Default partition split so partition one gets all the write hints
exclusively
Folded in the fix for stacking block stream feature for nvme-multipath
(from hch xfs-zoned-streams branch)
Kanchan Joshi (2):
io_uring: enable per-io hinting capability
nvme: enable FDP support
Keith Busch (7):
block: use generic u16 for write hints
block: introduce max_write_hints queue limit
statx: add write hint information
block: allow ability to limit partition write hints
block, fs: add write hint to kiocb
block: export placement hint feature
scsi: set permanent stream count in block limits
Documentation/ABI/stable/sysfs-block | 14 ++++++
block/bdev.c | 22 +++++++++
block/blk-settings.c | 5 ++
block/blk-sysfs.c | 6 +++
block/fops.c | 31 +++++++++++--
block/partitions/core.c | 45 +++++++++++++++++-
drivers/nvme/host/core.c | 69 ++++++++++++++++++++++++++++
drivers/nvme/host/multipath.c | 3 +-
drivers/nvme/host/nvme.h | 5 ++
drivers/scsi/sd.c | 2 +
fs/stat.c | 1 +
include/linux/blk-mq.h | 3 +-
include/linux/blk_types.h | 4 +-
include/linux/blkdev.h | 15 ++++++
include/linux/fs.h | 1 +
include/linux/nvme.h | 37 +++++++++++++++
include/linux/stat.h | 1 +
include/uapi/linux/io_uring.h | 4 ++
include/uapi/linux/stat.h | 3 +-
io_uring/io_uring.c | 2 +
io_uring/rw.c | 2 +-
21 files changed, 263 insertions(+), 12 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCHv11 1/9] block: use generic u16 for write hints
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 2/9] block: introduce max_write_hints queue limit Keith Busch
` (6 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch, Bart Van Assche
From: Keith Busch <[email protected]>
This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition. Using this type doesn't change
the size of either bio or request.
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/blk-mq.h | 3 +--
include/linux/blk_types.h | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2035fad3131fb..08ed7b5c4dbbf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,7 +8,6 @@
#include <linux/scatterlist.h>
#include <linux/prefetch.h>
#include <linux/srcu.h>
-#include <linux/rw_hint.h>
struct blk_mq_tags;
struct blk_flush_queue;
@@ -156,7 +155,7 @@ struct request {
struct blk_crypto_keyslot *crypt_keyslot;
#endif
- enum rw_hint write_hint;
+ unsigned short write_hint;
unsigned short ioprio;
enum mq_rq_state state;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..6737795220e18 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -10,7 +10,6 @@
#include <linux/bvec.h>
#include <linux/device.h>
#include <linux/ktime.h>
-#include <linux/rw_hint.h>
struct bio_set;
struct bio;
@@ -219,7 +218,7 @@ struct bio {
*/
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
- enum rw_hint bi_write_hint;
+ unsigned short bi_write_hint;
blk_status_t bi_status;
atomic_t __bi_remaining;
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 2/9] block: introduce max_write_hints queue limit
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
2024-11-08 19:36 ` [PATCHv11 1/9] block: use generic u16 for write hints Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 3/9] statx: add write hint information Keith Busch
` (5 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 7 +++++++
block/blk-settings.c | 3 +++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 12 ++++++++++++
4 files changed, 25 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8353611107154..f2db2cabb8e75 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
+What: /sys/block/<disk>/queue/max_write_hints
+Date: October 2024
+Contact: [email protected]
+Description:
+ [RO] Maximum number of write hints supported, 0 if not
+ supported. If supported, valid values are 1 through
+ max_write_hints, inclusive.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5ee3d6d1448df..f9f831f104615 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
/* Inherit limits from component devices */
+ lim->max_write_hints = USHRT_MAX;
lim->max_segments = USHRT_MAX;
lim->max_discard_segments = USHRT_MAX;
lim->max_hw_sectors = UINT_MAX;
@@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
+
alignment = queue_limit_alignment_offset(b, start);
/* Bottom device has different alignment. Check that it is
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0ef4e13e247d9..1925ea23bd290 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_hints)
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -457,6 +458,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_hints, "max_write_hints");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -591,6 +593,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_discard_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
+ &queue_max_write_hints_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b51a7c92e9be..1477f751ad8bd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -394,6 +394,8 @@ struct queue_limits {
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_write_hints;
+
unsigned int max_open_zones;
unsigned int max_active_zones;
@@ -1198,6 +1200,11 @@ static inline unsigned short queue_max_segments(const struct request_queue *q)
return q->limits.max_segments;
}
+static inline unsigned short queue_max_write_hints(struct request_queue *q)
+{
+ return q->limits.max_write_hints;
+}
+
static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
{
return q->limits.max_discard_segments;
@@ -1245,6 +1252,11 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
return queue_max_segments(bdev_get_queue(bdev));
}
+static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
+{
+ return queue_max_write_hints(bdev_get_queue(bdev));
+}
+
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
return q->limits.logical_block_size;
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 3/9] statx: add write hint information
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
2024-11-08 19:36 ` [PATCHv11 1/9] block: use generic u16 for write hints Keith Busch
2024-11-08 19:36 ` [PATCHv11 2/9] block: introduce max_write_hints queue limit Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 4/9] block: allow ability to limit partition write hints Keith Busch
` (4 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
If requested on a raw block device, report the maximum write hint the
block device supports.
Signed-off-by: Keith Busch <[email protected]>
---
block/bdev.c | 5 +++++
fs/stat.c | 1 +
include/linux/stat.h | 1 +
include/uapi/linux/stat.h | 3 ++-
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..9a59f0c882170 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1296,6 +1296,11 @@ void bdev_statx(struct path *path, struct kstat *stat,
stat->result_mask |= STATX_DIOALIGN;
}
+ if (request_mask & STATX_WRITE_HINT) {
+ stat->write_hint_max = bdev_max_write_hints(bdev);
+ stat->result_mask |= STATX_WRITE_HINT;
+ }
+
if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) {
struct request_queue *bd_queue = bdev->bd_queue;
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e3..60bcd5c2e2a1d 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -704,6 +704,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
+ tmp.stx_write_hint_max = stat->write_hint_max;
return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 3d900c86981c5..48f0f64846a02 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -57,6 +57,7 @@ struct kstat {
u32 atomic_write_unit_min;
u32 atomic_write_unit_max;
u32 atomic_write_segments_max;
+ u32 write_hint_max;
};
/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a252864416..10f5622c21113 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -132,7 +132,7 @@ struct statx {
__u32 stx_atomic_write_unit_max; /* Max atomic write unit in bytes */
/* 0xb0 */
__u32 stx_atomic_write_segments_max; /* Max atomic write segment count */
- __u32 __spare1[1];
+ __u32 stx_write_hint_max;
/* 0xb8 */
__u64 __spare3[9]; /* Spare space for future expansion */
/* 0x100 */
@@ -164,6 +164,7 @@ struct statx {
#define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */
#define STATX_SUBVOL 0x00008000U /* Want/got stx_subvol */
#define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */
+#define STATX_WRITE_HINT 0x00020000U /* Want/got write_hint_max */
#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 4/9] block: allow ability to limit partition write hints
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
` (2 preceding siblings ...)
2024-11-08 19:36 ` [PATCHv11 3/9] statx: add write hint information Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 5/9] block, fs: add write hint to kiocb Keith Busch
` (3 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 7 +++++
block/bdev.c | 17 +++++++++++
block/partitions/core.c | 45 ++++++++++++++++++++++++++--
include/linux/blk_types.h | 1 +
4 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index f2db2cabb8e75..fa2db6b638d63 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -187,6 +187,13 @@ Description:
partition is offset from the internal allocation unit's
natural alignment.
+What: /sys/block/<disk>/<partition>/write_hint_mask
+Date: October 2024
+Contact: [email protected]
+Description:
+ The mask of allowed write hints. You can limit which hints the
+ block layer will use by writing a new mask. Only the first
+ partition can access all the write hints by default.
What: /sys/block/<disk>/<partition>/stat
Date: February 2008
diff --git a/block/bdev.c b/block/bdev.c
index 9a59f0c882170..e6f9d19db599b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -415,6 +415,7 @@ void __init bdev_cache_init(void)
struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
{
struct block_device *bdev;
+ unsigned short write_hint;
struct inode *inode;
inode = new_inode(blockdev_superblock);
@@ -440,6 +441,22 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+
+ write_hint = bdev_max_write_hints(bdev);
+ if (write_hint) {
+ bdev->write_hint_mask = bitmap_alloc(write_hint, GFP_KERNEL);
+ if (!bdev->write_hint_mask) {
+ free_percpu(bdev->bd_stats);
+ iput(inode);
+ return NULL;
+ }
+
+ if (partno == 1)
+ bitmap_set(bdev->write_hint_mask, 0, write_hint);
+ else
+ bitmap_clear(bdev->write_hint_mask, 0, write_hint);
+ }
+
return bdev;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 815ed33caa1b8..c71a5d34339d7 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -203,6 +203,41 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_write_hint_mask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ unsigned short max_write_hints = bdev_max_write_hints(bdev);
+
+ if (!max_write_hints)
+ return sprintf(buf, "0");
+ return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
+}
+
+static ssize_t part_write_hint_mask_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ unsigned short max_write_hints = bdev_max_write_hints(bdev);
+ unsigned long *new_mask;
+
+ if (!max_write_hints)
+ return count;
+
+ new_mask = bitmap_alloc(max_write_hints, GFP_KERNEL);
+ if (!new_mask)
+ return -ENOMEM;
+
+ bitmap_parse(buf, count, new_mask, max_write_hints);
+ bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
+ smp_wmb();
+ bitmap_free(new_mask);
+
+ return count;
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -211,6 +246,8 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(write_hint_mask, 0644, part_write_hint_mask_show,
+ part_write_hint_mask_store);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -225,6 +262,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_write_hint_mask.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
@@ -245,8 +283,11 @@ static const struct attribute_group *part_attr_groups[] = {
static void part_release(struct device *dev)
{
- put_disk(dev_to_bdev(dev)->bd_disk);
- bdev_drop(dev_to_bdev(dev));
+ struct block_device *part = dev_to_bdev(dev);
+
+ bitmap_free(part->write_hint_mask);
+ put_disk(part->bd_disk);
+ bdev_drop(part);
}
static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6737795220e18..af430e543f7f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -73,6 +73,7 @@ struct block_device {
#ifdef CONFIG_SECURITY
void *bd_security;
#endif
+ unsigned long *write_hint_mask;
/*
* keep this out-of-line as it's both big and not needed in the fast
* path
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 5/9] block, fs: add write hint to kiocb
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
` (3 preceding siblings ...)
2024-11-08 19:36 ` [PATCHv11 4/9] block: allow ability to limit partition write hints Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 6/9] io_uring: enable per-io hinting capability Keith Busch
` (2 subsequent siblings)
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's allowed hints. The hint field in the kiocb
structure fits in an existing 2-byte hole, so its size is not changed.
Signed-off-by: Keith Busch <[email protected]>
---
block/fops.c | 31 ++++++++++++++++++++++++++++---
include/linux/fs.h | 1 +
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index 2d01c90076813..bb3855ee044f0 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -71,7 +71,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio.bi_write_hint = iocb->ki_write_hint;
bio.bi_ioprio = iocb->ki_ioprio;
if (iocb->ki_flags & IOCB_ATOMIC)
bio.bi_opf |= REQ_ATOMIC;
@@ -200,7 +200,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
for (;;) {
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -316,7 +316,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
dio->flags = 0;
dio->iocb = iocb;
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_end_io = blkdev_bio_end_io_async;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -362,6 +362,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
return -EIOCBQUEUED;
}
+static int blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
+{
+ u16 hint = iocb->ki_write_hint;
+
+ if (!hint)
+ return file_inode(iocb->ki_filp)->i_write_hint;
+
+ if (hint > bdev_max_write_hints(bdev))
+ return -EINVAL;
+
+ if (bdev_is_partition(bdev) &&
+ !test_bit(hint - 1, bdev->write_hint_mask))
+ return -EINVAL;
+
+ return hint;
+}
+
static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
@@ -373,6 +390,14 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (blkdev_dio_invalid(bdev, iocb, iter))
return -EINVAL;
+ if (iov_iter_rw(iter) == WRITE) {
+ int hint = blkdev_write_hint(iocb, bdev);
+
+ if (hint < 0)
+ return hint;
+ iocb->ki_write_hint = hint;
+ }
+
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
if (likely(nr_pages <= BIO_MAX_VECS)) {
if (is_sync_kiocb(iocb))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a1268..1a00accf412e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
void *private;
int ki_flags;
u16 ki_ioprio; /* See linux/ioprio.h */
+ u16 ki_write_hint;
union {
/*
* Only used for async buffered reads, where it denotes the
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 6/9] io_uring: enable per-io hinting capability
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
` (4 preceding siblings ...)
2024-11-08 19:36 ` [PATCHv11 5/9] block, fs: add write hint to kiocb Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-08 19:36 ` [PATCHv11 7/9] block: export placement hint feature Keith Busch
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Nitesh Shetty, Keith Busch
From: Kanchan Joshi <[email protected]>
With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes will be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.
Allow userspace to pass additional metadata in the SQE.
__u16 write_hint;
If the hint is provided, filesystems may optionally use it. A filesytem
may ignore this field if it does not support per-io hints, or if the
value is invalid for its backing storage. Just like the inode hints,
requesting values that are not supported by the hardware are not an
error.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/uapi/linux/io_uring.h | 4 ++++
io_uring/io_uring.c | 2 ++
io_uring/rw.c | 2 +-
3 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 56cf30b49ef5f..4a6c95c923eb4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,10 @@ struct io_uring_sqe {
__u64 addr3;
__u64 __pad2[1];
};
+ struct {
+ __u64 __pad4[1];
+ __u16 write_hint;
+ };
__u64 optval;
/*
* If the ring is initialized with IORING_SETUP_SQE128, then
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 076171977d5e3..115af82b9151f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -4169,6 +4169,8 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
BUILD_BUG_SQE_ELEM(48, __u64, addr3);
BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
+ BUILD_BUG_SQE_ELEM(48, __u64, __pad4);
+ BUILD_BUG_SQE_ELEM(56, __u16, write_hint);
BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 93526a64ccd60..fdab23424f386 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -279,7 +279,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->kiocb.ki_ioprio = get_current_ioprio();
}
rw->kiocb.dio_complete = NULL;
-
+ rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
rw->addr = READ_ONCE(sqe->addr);
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCHv11 7/9] block: export placement hint feature
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
` (5 preceding siblings ...)
2024-11-08 19:36 ` [PATCHv11 6/9] io_uring: enable per-io hinting capability Keith Busch
@ 2024-11-08 19:36 ` Keith Busch
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
7 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-08 19:36 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe
Cc: hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
From: Keith Busch <[email protected]>
Add a feature flag for devices that support generic placement hints in
write commands. This is in contrast to data lifetime hints.
Signed-off-by: Keith Busch <[email protected]>
---
block/blk-settings.c | 2 ++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 3 +++
3 files changed, 8 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f9f831f104615..b809f31ad84f2 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -518,6 +518,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->features &= ~BLK_FEAT_NOWAIT;
if (!(b->features & BLK_FEAT_POLL))
t->features &= ~BLK_FEAT_POLL;
+ if (!(b->features & BLK_FEAT_PLACEMENT_HINTS))
+ t->features &= ~BLK_FEAT_PLACEMENT_HINTS;
t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1925ea23bd290..6280c5f89b8b7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -260,6 +260,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(placement_hints, BLK_FEAT_PLACEMENT_HINTS);
static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
{
@@ -497,6 +498,7 @@ QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay");
QUEUE_RW_ENTRY(queue_wc, "write_cache");
QUEUE_RO_ENTRY(queue_fua, "fua");
QUEUE_RO_ENTRY(queue_dax, "dax");
+QUEUE_RO_ENTRY(queue_placement_hints, "placement_hints");
QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
@@ -626,6 +628,7 @@ static struct attribute *queue_attrs[] = {
&queue_wc_entry.attr,
&queue_fua_entry.attr,
&queue_dax_entry.attr,
+ &queue_placement_hints_entry.attr,
&queue_poll_delay_entry.attr,
&queue_virt_boundary_mask_entry.attr,
&queue_dma_alignment_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1477f751ad8bd..2ffe9a3b9dbff 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -333,6 +333,9 @@ typedef unsigned int __bitwise blk_features_t;
#define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \
((__force blk_features_t)(1u << 15))
+/* supports generic write placement hints */
+#define BLK_FEAT_PLACEMENT_HINTS ((__force blk_features_t)(1u << 16))
+
/*
* Flags automatically inherited when stacking limits.
*/
--
2.43.5
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
` (6 preceding siblings ...)
2024-11-08 19:36 ` [PATCHv11 7/9] block: export placement hint feature Keith Busch
@ 2024-11-11 10:29 ` Christoph Hellwig
2024-11-11 16:27 ` Keith Busch
2024-11-12 13:26 ` Kanchan Joshi
7 siblings, 2 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-11 10:29 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe, hch, martin.petersen, asml.silence, javier.gonz, joshi.k,
Keith Busch
On Fri, Nov 08, 2024 at 11:36:20AM -0800, Keith Busch wrote:
> Default partition split so partition one gets all the write hints
> exclusively
I still don't think this actually works as expected, as the user
interface says the write streams are contigous, and with the bitmap
they aren't.
As I seem to have a really hard time to get my point across, I instead
spent this morning doing a POC of what I mean, and pushed it here:
http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/block-write-streams
The big differences are:
- there is a separate write_stream value now instead of overloading
the write hint. For now it is an 8-bit field for the internal
data structures so that we don't have to grow the bio, but all the
user interfaces are kept at 16 bits (or in case of statx reduced to
it). If this becomes now enough because we need to support devices
with multiple reclaim groups we'll have to find some space by using
unions or growing structures
- block/fops.c is the place to map the existing write hints into
the write streams instead of the driver
- the stream granularity is added, because adding it to statx at a
later time would be nasty. Getting it in nvme is actually amazingly
cumbersome so I gave up on that and just fed a dummy value for
testing, though
- the partitions remapping is now done using an offset into the global
write stream space so that the there is a contiguous number space.
The interface for this is rather hacky, so only treat it as a start
for interface and use case discussions.
- the generic stack limits code stopped stacking the max write
streams. While it does the right thing for simple things like
multipath and mirroring/striping is is wrong for anything non-trivial
like parity raid. I've left this as a separate fold patch for the
discussion.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
@ 2024-11-11 16:27 ` Keith Busch
2024-11-11 16:34 ` Christoph Hellwig
2024-11-12 13:26 ` Kanchan Joshi
1 sibling, 1 reply; 28+ messages in thread
From: Keith Busch @ 2024-11-11 16:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, linux-fsdevel,
io-uring, axboe, martin.petersen, asml.silence, javier.gonz,
joshi.k
On Mon, Nov 11, 2024 at 11:29:14AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:36:20AM -0800, Keith Busch wrote:
> > Default partition split so partition one gets all the write hints
> > exclusively
>
> I still don't think this actually works as expected, as the user
> interface says the write streams are contigous, and with the bitmap
> they aren't.
>
> As I seem to have a really hard time to get my point across, I instead
> spent this morning doing a POC of what I mean, and pushed it here:
>
> http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/block-write-streams
Just purely for backward compatibility, I don't think you can have the
nvme driver error out if a stream is too large. The fcntl lifetime hint
never errored out before, which gets set unconditionally from the
file_inode without considering the block device's max write stream.
> The big differences are:
>
> - there is a separate write_stream value now instead of overloading
> the write hint. For now it is an 8-bit field for the internal
> data structures so that we don't have to grow the bio, but all the
> user interfaces are kept at 16 bits (or in case of statx reduced to
> it). If this becomes now enough because we need to support devices
> with multiple reclaim groups we'll have to find some space by using
> unions or growing structures
As far as I know, 255 possible streams exceeds any use case I know
about.
> - block/fops.c is the place to map the existing write hints into
> the write streams instead of the driver
I might be something here, but that part sure looks the same as what's
in this series.
> - the stream granularity is added, because adding it to statx at a
> later time would be nasty. Getting it in nvme is actually amazingly
> cumbersome so I gave up on that and just fed a dummy value for
> testing, though
Just regarding the documentation on the write_stream_granularity, you
don't need to discard the entire RU in a single command. You can
invalidate the RU simply by overwriting the LBAs without ever issuing
any discard commands.
If you really want to treat it this way, you need to ensure the first
LBA written to an RU is always aligned to NPDA/NPDAL.
If this is really what you require to move this forward, though, that's
fine with me.
> - the partitions remapping is now done using an offset into the global
> write stream space so that the there is a contiguous number space.
> The interface for this is rather hacky, so only treat it as a start
> for interface and use case discussions.
> - the generic stack limits code stopped stacking the max write
> streams. While it does the right thing for simple things like
> multipath and mirroring/striping is is wrong for anything non-trivial
> like parity raid. I've left this as a separate fold patch for the
> discussion.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-11 16:27 ` Keith Busch
@ 2024-11-11 16:34 ` Christoph Hellwig
0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-11 16:34 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, linux-fsdevel, io-uring, axboe, martin.petersen,
asml.silence, javier.gonz, joshi.k
On Mon, Nov 11, 2024 at 09:27:33AM -0700, Keith Busch wrote:
> Just purely for backward compatibility, I don't think you can have the
> nvme driver error out if a stream is too large. The fcntl lifetime hint
> never errored out before, which gets set unconditionally from the
> file_inode without considering the block device's max write stream.
True. But block/fops.c should simply not the write hint in that
case (or even do a bit of folding if we care enough).
> > - block/fops.c is the place to map the existing write hints into
> > the write streams instead of the driver
>
> I might be something here, but that part sure looks the same as what's
> in this series.
Your series simply mixes up the existing write (temperature) hint and
the write stream, including for file system use. This version does
something very similar, but only for block devices.
>
> > - the stream granularity is added, because adding it to statx at a
> > later time would be nasty. Getting it in nvme is actually amazingly
> > cumbersome so I gave up on that and just fed a dummy value for
> > testing, though
>
> Just regarding the documentation on the write_stream_granularity, you
> don't need to discard the entire RU in a single command. You can
> invalidate the RU simply by overwriting the LBAs without ever issuing
> any discard commands.
True. Did I managed this was a quick hack job?
> If you really want to treat it this way, you need to ensure the first
> LBA written to an RU is always aligned to NPDA/NPDAL.
Those are just hints as well, but I agree you probably get much
better results if they do.
> If this is really what you require to move this forward, though, that's
> fine with me.
I could move it forward, but right now I'm more than over subsribed.
If someone actually pushing for this work could put more effort into it
it will surely be faster.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
2024-11-11 16:27 ` Keith Busch
@ 2024-11-12 13:26 ` Kanchan Joshi
2024-11-12 13:34 ` Christoph Hellwig
1 sibling, 1 reply; 28+ messages in thread
From: Kanchan Joshi @ 2024-11-12 13:26 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, linux-fsdevel, io-uring,
axboe, martin.petersen, asml.silence, javier.gonz, Keith Busch
On 11/11/2024 3:59 PM, Christoph Hellwig wrote:
> - there is a separate write_stream value now instead of overloading
> the write hint. For now it is an 8-bit field for the internal
> data structures so that we don't have to grow the bio, but all the
> user interfaces are kept at 16 bits (or in case of statx reduced to
> it). If this becomes now enough because we need to support devices
> with multiple reclaim groups we'll have to find some space by using
> unions or growing structures
> - block/fops.c is the place to map the existing write hints into
> the write streams instead of the driver
Last time when I attempted this separation between temperature and
placement hints, it required adding a new fcntl[*] too because
per-inode/file hints continues to be useful even when they are treated
as passthrough by FS.
Applications are able to use temperature hints to group multiple files
on device regardless of the logical placement made by FS.
The same ability is useful for write-streams/placement-hints too. But
these patches reduce the scope to only block device.
IMO, passthrough propagation of hints/streams should continue to remain
the default behavior as it applies on multiple filesystems. And more
active placement by FS should rather be enabled by some opt in (e.g.,
mount option). Such opt in will anyway be needed for other reasons (like
regression avoidance on a broken device).
[*]
https://lore.kernel.org/linux-nvme/[email protected]/
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 13:26 ` Kanchan Joshi
@ 2024-11-12 13:34 ` Christoph Hellwig
2024-11-12 14:25 ` Keith Busch
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-12 13:34 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, linux-fsdevel, io-uring, axboe, martin.petersen,
asml.silence, javier.gonz, Keith Busch
On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> IMO, passthrough propagation of hints/streams should continue to remain
> the default behavior as it applies on multiple filesystems. And more
> active placement by FS should rather be enabled by some opt in (e.g.,
> mount option). Such opt in will anyway be needed for other reasons (like
> regression avoidance on a broken device).
I feel like banging my head against the wall. No, passing through write
streams is simply not acceptable without the file system being in
control. I've said and explained this in detail about a dozend times
and the file system actually needing to do data separation for it's own
purpose doesn't go away by ignoring it.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 13:34 ` Christoph Hellwig
@ 2024-11-12 14:25 ` Keith Busch
2024-11-12 16:50 ` Christoph Hellwig
2024-11-12 18:18 ` [EXT] " Pierre Labat
0 siblings, 2 replies; 28+ messages in thread
From: Keith Busch @ 2024-11-12 14:25 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, Keith Busch, linux-block, linux-nvme, linux-scsi,
linux-fsdevel, io-uring, axboe, martin.petersen, asml.silence,
javier.gonz
On Tue, Nov 12, 2024 at 02:34:39PM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> > IMO, passthrough propagation of hints/streams should continue to remain
> > the default behavior as it applies on multiple filesystems. And more
> > active placement by FS should rather be enabled by some opt in (e.g.,
> > mount option). Such opt in will anyway be needed for other reasons (like
> > regression avoidance on a broken device).
>
> I feel like banging my head against the wall. No, passing through write
> streams is simply not acceptable without the file system being in
> control. I've said and explained this in detail about a dozend times
> and the file system actually needing to do data separation for it's own
> purpose doesn't go away by ignoring it.
But that's just an ideological decision that doesn't jive with how
people use these. The applications know how they use their data better
than the filesystem, so putting the filesystem in the way to force
streams look like zones is just a unnecessary layer of indirection
getting in the way.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 14:25 ` Keith Busch
@ 2024-11-12 16:50 ` Christoph Hellwig
2024-11-12 17:19 ` Christoph Hellwig
2024-11-12 18:18 ` [EXT] " Pierre Labat
1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-12 16:50 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, linux-block,
linux-nvme, linux-scsi, linux-fsdevel, io-uring, axboe,
martin.petersen, asml.silence, javier.gonz
On Tue, Nov 12, 2024 at 07:25:45AM -0700, Keith Busch wrote:
> > I feel like banging my head against the wall. No, passing through write
> > streams is simply not acceptable without the file system being in
> > control. I've said and explained this in detail about a dozend times
> > and the file system actually needing to do data separation for it's own
> > purpose doesn't go away by ignoring it.
>
> But that's just an ideological decision that doesn't jive with how
> people use these.
Sorry, but no it is not. The file system is the entity that owns the
block device, and it is the layer that manages the block device.
Bypassing it is an layering violation that creates a lot of problems
and solves none at all.
> The applications know how they use their data better
> than the filesystem,
That is a very bold assumption, and a clear indication that you are
actually approaching this with a rather idiological hat. If your
specific application actually thinks it knows the storage better than
the file system that you are using you probably should not be using
that file system. Use a raw block device or even better passthrough
or spdk if you really know what you are doing (or at least thing so).
Otherwise you need to agree that the file system is the final arbiter
of the underlying device resource. Hint: if you have an application
that knows that it is doing (there actually are a few of those) it's
usually not hard to actually work with file system people to create
abstractions that don't poke holes into layering but still give the
applications what you want. There's also the third option of doing
something like what Damien did with zonefs and actually create an
abstraction for what what your are doing.
> so putting the filesystem in the way to force
> streams look like zones is just a unnecessary layer of indirection
> getting in the way.
Can you please stop this BS? Even if a file system doesn't treat
write streams like zones keeps LBA space and physical allocation units
entirely separate (for which I see no good reason, but others might
disagree) you still need the file system in control of the hardware
resources.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 16:50 ` Christoph Hellwig
@ 2024-11-12 17:19 ` Christoph Hellwig
0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-12 17:19 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Kanchan Joshi, Keith Busch, linux-block,
linux-nvme, linux-scsi, linux-fsdevel, io-uring, axboe,
martin.petersen, asml.silence, javier.gonz
On Tue, Nov 12, 2024 at 05:50:54PM +0100, Christoph Hellwig wrote:
> > so putting the filesystem in the way to force
> > streams look like zones is just a unnecessary layer of indirection
> > getting in the way.
>
> Can you please stop this BS? Even if a file system doesn't treat
> write streams like zones keeps LBA space and physical allocation units
> entirely separate (for which I see no good reason, but others might
> disagree) you still need the file system in control of the hardware
> resources.
And in case this wasn't clear enough. Let's assume you want to write
a low write amp flash optimized file system similar to say the storage
layers of the all flash arrays of the last 10-15 years.
You really want to avoid device GC. You'd better group your data to
the reclaim unit / erase block / insert name here. So you need file
system control of the write streams, you need to know their size,
you need to be able to query how much your wrote after a power faіl.
Totally independent of how you organize your LBA space. Mapping
it linearly might be the easier options without major downside, but
you could also allocate them randomly for that matter.
>
---end quoted text---
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 14:25 ` Keith Busch
2024-11-12 16:50 ` Christoph Hellwig
@ 2024-11-12 18:18 ` Pierre Labat
2024-11-13 4:47 ` Christoph Hellwig
1 sibling, 1 reply; 28+ messages in thread
From: Pierre Labat @ 2024-11-12 18:18 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig
Cc: Kanchan Joshi, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
My 2 cents.
Overall, it seems to me that the difficulty here comes from 2 things:
1) The write hints may have different semantics (temperature, FDP placement, and whatever will come next).
2) Different software layers may want to use the hints, and if several do that at the same time on the same storage that may result in a mess.
About 1)
Seems to me that having a different interface for each semantic is an overkill, extra code to maintain. And extra work when a new semantic comes along.
To keep things simple, keep one set of interfaces (per IO interface, per file interface) for all write hints semantics, and carry the difference in semantic in the hint itself.
For example, with 32 bits hints, store the semantic in 8 bits and the use the rest in the context of that semantic.
The storage transport driver (nvme driver for ex), based on the 8 bits semantic in the write hint, translates adequately the write hint for the storage device.
The storage driver can support several translations, one for each semantics supported. Linux doesn't need to yank out a translation to replace it with a another/new one.
About 2)
Provide a simple way to the user to decide which layer generate write hints.
As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
Both use cases are legit, I think.
To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
In the case of an application not needing/wanting to use write hints on its own, the user would mount the filesystem enabling generation of write hints. That could be the default.
On the contrary if the user decides it is best for one application to directly generate write hints to get the best performance, then mount the filesystem disabling the generation of write hints by the FS. The FS act as a passthrough regarding write hints.
Regards,
Pierre
> -----Original Message-----
> From: Keith Busch <[email protected]>
> Sent: Tuesday, November 12, 2024 6:26 AM
> To: Christoph Hellwig <[email protected]>
> Cc: Kanchan Joshi <[email protected]>; Keith Busch
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
>
>
> On Tue, Nov 12, 2024 at 02:34:39PM +0100, Christoph Hellwig wrote:
> > On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> > > IMO, passthrough propagation of hints/streams should continue to
> > > remain the default behavior as it applies on multiple filesystems.
> > > And more active placement by FS should rather be enabled by some opt
> > > in (e.g., mount option). Such opt in will anyway be needed for other
> > > reasons (like regression avoidance on a broken device).
> >
> > I feel like banging my head against the wall. No, passing through
> > write streams is simply not acceptable without the file system being
> > in control. I've said and explained this in detail about a dozend
> > times and the file system actually needing to do data separation for
> > it's own purpose doesn't go away by ignoring it.
>
> But that's just an ideological decision that doesn't jive with how people use
> these. The applications know how they use their data better than the
> filesystem, so putting the filesystem in the way to force streams look like zones
> is just a unnecessary layer of indirection getting in the way.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-12 18:18 ` [EXT] " Pierre Labat
@ 2024-11-13 4:47 ` Christoph Hellwig
2024-11-13 23:51 ` Dave Chinner
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-13 4:47 UTC (permalink / raw)
To: Pierre Labat
Cc: Keith Busch, Christoph Hellwig, Kanchan Joshi, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Tue, Nov 12, 2024 at 06:18:21PM +0000, Pierre Labat wrote:
> Overall, it seems to me that the difficulty here comes from 2 things:
> 1) The write hints may have different semantics (temperature, FDP placement, and whatever will come next).
Or rather trying to claim all these are "write hints" is simply the wrong
approach :)
> 2) Different software layers may want to use the hints, and if several do that at the same time on the same storage that may result in a mess.
That's a very nice but almost to harmless way to phrase it.
> About 1)
> Seems to me that having a different interface for each semantic is an overkill, extra code to maintain. And extra work when a new semantic comes along.
> To keep things simple, keep one set of interfaces (per IO interface, per file interface) for all write hints semantics, and carry the difference in semantic in the hint itself.
> For example, with 32 bits hints, store the semantic in 8 bits and the use the rest in the context of that semantic.
This is very similar to what the never followed up upon Kanchan did.
I think this is a lot better than blindly overloading a generic
"write hint", but still suffers from problems:
a) the code is a lot more complex and harder to maintain than just two
different values
b) it still keeps the idea that a simple temperature hint and write
stream or placement identifiers are someting comparable, which they
really aren't.
> About 2)
> Provide a simple way to the user to decide which layer generate write hints.
> As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
> Both use cases are legit, I think.
> To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
The file system is, and always has been, the entity in charge of
resource allocation of the underlying device. Bypassing it will get
you in trouble, and a simple mount option isn't really changing that
(it's also not exactly a scalable interface).
If an application wants to micro-manage placement decisions it should not
use a file system, or at least not a normal one with Posix semantics.
That being said we'd demonstrated that applications using proper grouping
of data by file and the simple temperature hints can get very good result
from file systems that can interpret them, without a lot of work in the
file system. I suspect for most applications that actually want files
that is actually going to give better results than trying to do the
micro-management that tries to bypass the file system.
I'm not sure if Keith was just ranting last night, but IFF the assumption
here is that file systems are just used as dumb containers and applications
manage device level placement inside them we have a much deeper problem
than just interface semantics.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-13 4:47 ` Christoph Hellwig
@ 2024-11-13 23:51 ` Dave Chinner
2024-11-14 3:09 ` Martin K. Petersen
2024-11-14 6:07 ` Christoph Hellwig
0 siblings, 2 replies; 28+ messages in thread
From: Dave Chinner @ 2024-11-13 23:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Pierre Labat, Keith Busch, Kanchan Joshi, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Wed, Nov 13, 2024 at 05:47:36AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 06:18:21PM +0000, Pierre Labat wrote:
> > About 2)
> > Provide a simple way to the user to decide which layer generate write hints.
> > As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
> > Both use cases are legit, I think.
> > To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
>
> The file system is, and always has been, the entity in charge of
> resource allocation of the underlying device. Bypassing it will get
> you in trouble, and a simple mount option isn't really changing that
> (it's also not exactly a scalable interface).
>
> If an application wants to micro-manage placement decisions it should not
> use a file system, or at least not a normal one with Posix semantics.
> That being said we'd demonstrated that applications using proper grouping
> of data by file and the simple temperature hints can get very good result
> from file systems that can interpret them, without a lot of work in the
> file system. I suspect for most applications that actually want files
> that is actually going to give better results than trying to do the
> micro-management that tries to bypass the file system.
This.
The most important thing that filesystems do behind the scenes is
manage -data locality-. XFS has thousands of lines of code to manage
and control data locality - the allocation policy API itself has a
*dozens* control parameters. We have 2 separate allocation
architectures (one btree based, one bitmap based) and multiple
locality policy algorithms. These juggled physical alignment, size
granularity, size limits, data type being allocated for, desired
locality targets, different search algorithms (e.g. first fit, best
fit, exact fit by size or location, etc), multiple fallback
strategies when the initial target cannot be met, etc.
Allocation policy management is the core of every block based
filesystem that has ever been written.
Specifically to this "stream hint" discussion: go look at the XFS
filestreams allocator.
SGI wrote an entirely new allocator for XFS whose only purpose in
life is to automatically separate individual streams of user data
into physically separate regions of LBA space.
This was written to optimise realtime ingest and playback of
multiple uncompressed 4k and 8k video data streams from big
isochronous SAN storage arrays back in ~2005. Each stream could be
up to 1.2GB/s of data. If the data for each IO was not exactly
placed in alignment with the storage array stripe cache granularity
(2MB, IIRC), then a cache miss would occur and the IO latency would
be too high and frames of data would be missed/dropped.
IOWs, we have an allocator in XFS that specifically designed to
separate indepedent streams of data to independent regions of the
filesystem LBA space to effcient support data IO rates in the order
of tens of GB/s.
What are we talking about now? Storage hardware that might be able
to do 10-15GB/s of IO that needs stream separation for efficient
management of the internal storage resources.
The fact we have previously solved this class of stream separation
problem at the filesystem level *without needing a user-controlled
API at all* is probably the most relevant fact missing from this
discussion.
As to the concern about stream/temp/hint translation consistency
across different hardware: the filesystem is the perfect place to
provide this abstraction to users. The block device can expose what
it supports, the user API can be fixed, and the filesystem can
provide the mapping between the two that won't change for the life
of the filesystem...
Long story short: Christoph is right.
The OS hints/streams API needs to be aligned to the capabilities
that filesystems already provide *as a primary design goal*. What
the new hardware might support is a secondary concern. i.e. hardware
driven software design is almost always a mistake: define the user
API and abstractions first, then the OS can reduce it sanely down to
what the specific hardware present is capable of supporting.
-Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-13 23:51 ` Dave Chinner
@ 2024-11-14 3:09 ` Martin K. Petersen
2024-11-14 6:07 ` Christoph Hellwig
1 sibling, 0 replies; 28+ messages in thread
From: Martin K. Petersen @ 2024-11-14 3:09 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Pierre Labat, Keith Busch, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Dave,
> The OS hints/streams API needs to be aligned to the capabilities that
> filesystems already provide *as a primary design goal*. What the new
> hardware might support is a secondary concern. i.e. hardware driven
> software design is almost always a mistake: define the user API and
> abstractions first, then the OS can reduce it sanely down to what the
> specific hardware present is capable of supporting.
I agree completely.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-13 23:51 ` Dave Chinner
2024-11-14 3:09 ` Martin K. Petersen
@ 2024-11-14 6:07 ` Christoph Hellwig
2024-11-15 16:28 ` Keith Busch
1 sibling, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-14 6:07 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Pierre Labat, Keith Busch, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Thu, Nov 14, 2024 at 10:51:09AM +1100, Dave Chinner wrote:
> Specifically to this "stream hint" discussion: go look at the XFS
> filestreams allocator.
Those are rally two different things - file streams is about how to
locate data into a single hardware write streams. SCSI/NVMe streams
including streams 2.0 (FDP) that this thread is about is about telling
the hardware about these streams, and also about allowing the file
systems (or other user of the storage) to pack into the actual
underlying hardware boundaries that matter for deleting/overwriting
this data.
Funnily enough Hans and I were just recently brainstorming on how to
tie both together for the zoned xfs work.
> SGI wrote an entirely new allocator for XFS whose only purpose in
> life is to automatically separate individual streams of user data
> into physically separate regions of LBA space.
One of the interesting quirks of streams/FDP is that they they operate
on (semi-)physical data placement that is completely decoupled from LBA
space. So if a file system or application really wants to track LBA
vs physical allocations separately it gives them a way to do that.
I don't really know what the advantage of having to track both is, but
people smarted than me might have good uses for it.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-14 6:07 ` Christoph Hellwig
@ 2024-11-15 16:28 ` Keith Busch
2024-11-15 16:53 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2024-11-15 16:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Pierre Labat, Kanchan Joshi, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Thu, Nov 14, 2024 at 07:07:10AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 10:51:09AM +1100, Dave Chinner wrote:
> > SGI wrote an entirely new allocator for XFS whose only purpose in
> > life is to automatically separate individual streams of user data
> > into physically separate regions of LBA space.
>
> One of the interesting quirks of streams/FDP is that they they operate
> on (semi-)physical data placement that is completely decoupled from LBA
> space.
That's not particularly interesting about FDP. All conventional NAND
SSDs operates that way. FDP just reports more stuff because that's what
people kept asking for. But it doesn't require you fundamentally change
how you acces it. You've singled out FDP to force a sequential write
requirement that that requires unique support from every filesystem
despite the feature not needing that.
We have demonstrated 40% reduction in write amplifcation from doing the
most simplist possible thing that doesn't require any filesystem or
kernel-user ABI changes at all. You might think that's not significant
enough to let people realize those gains without more invasive block
stack changes, but you may not buying NAND in bulk if that's the case.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-15 16:28 ` Keith Busch
@ 2024-11-15 16:53 ` Christoph Hellwig
2024-11-18 23:37 ` Keith Busch
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-15 16:53 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Dave Chinner, Pierre Labat, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Fri, Nov 15, 2024 at 09:28:05AM -0700, Keith Busch wrote:
> SSDs operates that way. FDP just reports more stuff because that's what
> people kept asking for. But it doesn't require you fundamentally change
> how you acces it. You've singled out FDP to force a sequential write
> requirement that that requires unique support from every filesystem
> despite the feature not needing that.
No I haven't. If you think so you are fundamentally misunderstanding
what I'm saying.
> We have demonstrated 40% reduction in write amplifcation from doing the
> most simplist possible thing that doesn't require any filesystem or
> kernel-user ABI changes at all. You might think that's not significant
> enough to let people realize those gains without more invasive block
> stack changes, but you may not buying NAND in bulk if that's the case.
And as iterared multiple times you are doing that by bypassing the
file system layer in a forceful way that breaks all abstractions and
makes your feature unavailabe for file systems.
I've also thrown your a nugget by first explaining and then even writing
protype code to show how you get what you want while using the proper
abstractions. But instead of a picking up on that you just whine like
this. Either spend a little bit of effort to actually get the interface
right or just shut up.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-15 16:53 ` Christoph Hellwig
@ 2024-11-18 23:37 ` Keith Busch
2024-11-19 7:15 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2024-11-18 23:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Dave Chinner, Pierre Labat, Kanchan Joshi, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Fri, Nov 15, 2024 at 05:53:48PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 15, 2024 at 09:28:05AM -0700, Keith Busch wrote:
> > SSDs operates that way. FDP just reports more stuff because that's what
> > people kept asking for. But it doesn't require you fundamentally change
> > how you acces it. You've singled out FDP to force a sequential write
> > requirement that that requires unique support from every filesystem
> > despite the feature not needing that.
>
> No I haven't. If you think so you are fundamentally misunderstanding
> what I'm saying.
We have an API that has existed for 10+ years. You are gatekeeping that
interface by declaring NVMe's FDP is not allowed to use it. Do I have
that wrong? You initially blocked this because you didn't like how the
spec committe worked. Now you've shifted to trying to pretend FDP
devices require explicit filesystem handholding that was explicely NOT
part of that protocol.
> > We have demonstrated 40% reduction in write amplifcation from doing the
> > most simplist possible thing that doesn't require any filesystem or
> > kernel-user ABI changes at all. You might think that's not significant
> > enough to let people realize those gains without more invasive block
> > stack changes, but you may not buying NAND in bulk if that's the case.
>
> And as iterared multiple times you are doing that by bypassing the
> file system layer in a forceful way that breaks all abstractions and
> makes your feature unavailabe for file systems.
Your filesystem layering breaks the abstraction and capabilities the
drives are providing. You're doing more harm than good trying to game
how the media works here.
> I've also thrown your a nugget by first explaining and then even writing
> protype code to show how you get what you want while using the proper
> abstractions.
Oh, the untested prototype that wasn't posted to any mailing list for
a serious review? The one that forces FDP to subscribe to the zoned
interface only for XFS, despite these devices being squarly in the
"conventional" SSD catagory and absolutely NOT zone devices? Despite I
have other users using other filesystems successfuly using the existing
interfaces that your prototype doesn't do a thing for? Yah, thanks...
I appreciate you put the time into getting your thoughts into actual
code and it does look very valuable for ACTUAL ZONE block devices. But
it seems to have missed the entire point of what this hardware feature
does. If you're doing low level media garbage collection with FDP and
tracking fake media write pointers, then you're doing it wrong. Please
use Open Channel and ZNS SSDs if you want that interface and stop
gatekeeping the EXISTING interface that has proven value in production
software today.
> But instead of a picking up on that you just whine like
> this. Either spend a little bit of effort to actually get the interface
> right or just shut up.
Why the fuck should I make an effort to do improve your pet project that
I don't have a customer for? They want to use the interface that was
created 10 years ago, exactly for the reason it was created, and no one
wants to introduce the risks of an untested and unproven major and
invasive filesystem and block stack change in the kernel in the near
term!
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-18 23:37 ` Keith Busch
@ 2024-11-19 7:15 ` Christoph Hellwig
2024-11-20 17:21 ` Darrick J. Wong
0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-19 7:15 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Dave Chinner, Pierre Labat, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Mon, Nov 18, 2024 at 04:37:08PM -0700, Keith Busch wrote:
> We have an API that has existed for 10+ years. You are gatekeeping that
> interface by declaring NVMe's FDP is not allowed to use it. Do I have
> that wrong? You initially blocked this because you didn't like how the
> spec committe worked. Now you've shifted to trying to pretend FDP
> devices require explicit filesystem handholding that was explicely NOT
> part of that protocol.
I'm not fucking gate keeping anything, I'm really tired of this claim
with absolutely no facts backing it up.
> > And as iterared multiple times you are doing that by bypassing the
> > file system layer in a forceful way that breaks all abstractions and
> > makes your feature unavailabe for file systems.
>
> Your filesystem layering breaks the abstraction and capabilities the
> drives are providing. You're doing more harm than good trying to game
> how the media works here.
How so?
> > I've also thrown your a nugget by first explaining and then even writing
> > protype code to show how you get what you want while using the proper
> > abstractions.
>
> Oh, the untested prototype that wasn't posted to any mailing list for
> a serious review? The one that forces FDP to subscribe to the zoned
> interface only for XFS, despite these devices being squarly in the
> "conventional" SSD catagory and absolutely NOT zone devices? Despite I
> have other users using other filesystems successfuly using the existing
> interfaces that your prototype doesn't do a thing for? Yah, thanks...
What zoned interface to FDP?
The exposed interface is to:
a) pick a write stream
b) expose the size of the reclaim unit
not done yet, but needed for good operation:
c) expose how much capacity in a reclaim unit has been written
This is about as good as it gets to map the FDP (and to a lesser extent
streams) interface to an abstract block layer API. If you have a better
suggestion to actually expose these capabilities I'm all ears.
Now _my_ preferred use of that interface is a write out of place,
map LBA regions to physical reclaim blocks file system. On the hand
hand because it actually fits the file system I'm writing, on the other
hand because industry experience has shown that this is a very good
fit to flash storage (even without any explicit placement). If you
think that's all wrong that fine, despite claims to the contrary from
you absolutely nothing in the interface forced you to do that.
You can roll the dice for your LBA allocations and write them using
a secure random number generator. The interface allows for all of that,
but I doubt your results will all that great. Not my business.
> I appreciate you put the time into getting your thoughts into actual
> code and it does look very valuable for ACTUAL ZONE block devices. But
> it seems to have missed the entire point of what this hardware feature
> does. If you're doing low level media garbage collection with FDP and
> tracking fake media write pointers, then you're doing it wrong. Please
> use Open Channel and ZNS SSDs if you want that interface and stop
> gatekeeping the EXISTING interface that has proven value in production
> software today.
Hey, feel free to come up with a better design. The whole point of a
proper block layer design is that you actually can do that!
> > But instead of a picking up on that you just whine like
> > this. Either spend a little bit of effort to actually get the interface
> > right or just shut up.
>
> Why the fuck should I make an effort to do improve your pet project that
> I don't have a customer for? They want to use the interface that was
> created 10 years ago, exactly for the reason it was created, and no one
> wants to introduce the risks of an untested and unproven major and
> invasive filesystem and block stack change in the kernel in the near
> term!
Because you apparently want an interface to FDP in the block layer. And
if you want that you need to stop bypassing the file systems as pointed
out not just by me but also at least one other file system maintainer
and the maintainer of the most used block subsystem. I've thrown you
some bones how that can be done while doing everything else you did
before (at least assuming you get the fs side past the fs maintainers),
but the only thanks for that is bullshit attacks at a personal level.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-19 7:15 ` Christoph Hellwig
@ 2024-11-20 17:21 ` Darrick J. Wong
2024-11-20 18:11 ` Keith Busch
0 siblings, 1 reply; 28+ messages in thread
From: Darrick J. Wong @ 2024-11-20 17:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Dave Chinner, Pierre Labat, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Tue, Nov 19, 2024 at 08:15:56AM +0100, Christoph Hellwig wrote:
> On Mon, Nov 18, 2024 at 04:37:08PM -0700, Keith Busch wrote:
> > We have an API that has existed for 10+ years. You are gatekeeping that
> > interface by declaring NVMe's FDP is not allowed to use it. Do I have
> > that wrong? You initially blocked this because you didn't like how the
> > spec committe worked. Now you've shifted to trying to pretend FDP
> > devices require explicit filesystem handholding that was explicely NOT
> > part of that protocol.
>
> I'm not fucking gate keeping anything, I'm really tired of this claim
> with absolutely no facts backing it up.
>
> > > And as iterared multiple times you are doing that by bypassing the
> > > file system layer in a forceful way that breaks all abstractions and
> > > makes your feature unavailabe for file systems.
> >
> > Your filesystem layering breaks the abstraction and capabilities the
> > drives are providing. You're doing more harm than good trying to game
> > how the media works here.
>
> How so?
>
> > > I've also thrown your a nugget by first explaining and then even writing
> > > protype code to show how you get what you want while using the proper
> > > abstractions.
> >
> > Oh, the untested prototype that wasn't posted to any mailing list for
> > a serious review? The one that forces FDP to subscribe to the zoned
> > interface only for XFS, despite these devices being squarly in the
> > "conventional" SSD catagory and absolutely NOT zone devices? Despite I
> > have other users using other filesystems successfuly using the existing
> > interfaces that your prototype doesn't do a thing for? Yah, thanks...
>
> What zoned interface to FDP?
>
> The exposed interface is to:
>
> a) pick a write stream
How do filesystem users pick a write stream? I get a pretty strong
sense that you're aiming to provide the ability for application software
to group together a bunch of (potentially arbitrary) files in a cohort.
Then (maybe?) you can say "This cohort of files are all expected to have
data blocks related to each other in some fashion, so put them together
so that the storage doesn't have to work so hard".
Part of my comprehension problem here (and probably why few fs people
commented on this thread) is that I have no idea what FDP is, or what
the write lifetime hints in scsi were/are, or what the current "hinting"
scheme is.
Is this what we're arguing about?
enum rw_hint {
WRITE_LIFE_NOT_SET = RWH_WRITE_LIFE_NOT_SET,
WRITE_LIFE_NONE = RWH_WRITE_LIFE_NONE,
WRITE_LIFE_SHORT = RWH_WRITE_LIFE_SHORT,
WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
} __packed;
(What happens if you have two disjoint sets of files, both of which are
MEDIUM, but they shouldn't be intertwined?)
Or are these new fdp hint things an overload of the existing write hint
fields in the iocb/inode/bio? With a totally different meaning from
anticipated lifetime of the data blocks?
If Christoph is going where I think he's going with xfs, then maybe the
idea here is to map allocation groups on the realtime volume (i.e.
rtgroups) to a zone (or a flash channel?), along with the ability for
user programs to say "allocations to this cohort of files should all
come from the same rtgroup".
But. The cohort associations exist entirely in the filesystem, so the
filesystem gets to figure out the association between rtgroup and
whatever scsi/nvme/fdp/whatever hints it sends the hardware. The
mappings will all be to the same LBA space, but that's the filesystem's
quirk, not necessarily a signal to the storage.
Writes within a cohort will target a single rtgroup until it fills up,
at which point we pick another rtgroup and start filling that.
Eventually something comes along to gc the surviving allocations in the
old rtgroups. (Aside: the xfs filestreams thing that Dave mentioned
upthread does something kind of like this, except that user programs
cannot convey cohort associations to the filesystem.)
AFAICT the cohort associations are a file property, which the filesystem
can persist in whatever way it wants -- e.g. (ab)using the u32 xfs
project id. Is there any use using a different cohort id for a specific
IO? I don't think there are any.
I suppose if your application wants to talk directly to a block device
then I could see why you'd want to be able to send write hints or fdp
placement whatevers directly to the storage, but I don't think that's
the majority of application programs. Or to put it another way -- you
can plumb in a blockdev-only direct interface to the storage hints, but
us filesystem folks will design whatever makes the most sense for us,
and will not probably not provide direct access to those hints. After
all, the whole point of xfs is to virtualize a block device.
<shrug> Just my ignorant 2c. Did that help, or did I just pour gasoline
on a flamewar?
--D
> b) expose the size of the reclaim unit
>
> not done yet, but needed for good operation:
>
> c) expose how much capacity in a reclaim unit has been written
>
> This is about as good as it gets to map the FDP (and to a lesser extent
> streams) interface to an abstract block layer API. If you have a better
> suggestion to actually expose these capabilities I'm all ears.
>
> Now _my_ preferred use of that interface is a write out of place,
> map LBA regions to physical reclaim blocks file system. On the hand
> hand because it actually fits the file system I'm writing, on the other
> hand because industry experience has shown that this is a very good
> fit to flash storage (even without any explicit placement). If you
> think that's all wrong that fine, despite claims to the contrary from
> you absolutely nothing in the interface forced you to do that.
>
> You can roll the dice for your LBA allocations and write them using
> a secure random number generator. The interface allows for all of that,
> but I doubt your results will all that great. Not my business.
>
> > I appreciate you put the time into getting your thoughts into actual
> > code and it does look very valuable for ACTUAL ZONE block devices. But
> > it seems to have missed the entire point of what this hardware feature
> > does. If you're doing low level media garbage collection with FDP and
> > tracking fake media write pointers, then you're doing it wrong. Please
> > use Open Channel and ZNS SSDs if you want that interface and stop
> > gatekeeping the EXISTING interface that has proven value in production
> > software today.
>
> Hey, feel free to come up with a better design. The whole point of a
> proper block layer design is that you actually can do that!
>
> > > But instead of a picking up on that you just whine like
> > > this. Either spend a little bit of effort to actually get the interface
> > > right or just shut up.
> >
> > Why the fuck should I make an effort to do improve your pet project that
> > I don't have a customer for? They want to use the interface that was
> > created 10 years ago, exactly for the reason it was created, and no one
> > wants to introduce the risks of an untested and unproven major and
> > invasive filesystem and block stack change in the kernel in the near
> > term!
>
> Because you apparently want an interface to FDP in the block layer. And
> if you want that you need to stop bypassing the file systems as pointed
> out not just by me but also at least one other file system maintainer
> and the maintainer of the most used block subsystem. I've thrown you
> some bones how that can be done while doing everything else you did
> before (at least assuming you get the fs side past the fs maintainers),
> but the only thanks for that is bullshit attacks at a personal level.
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-20 17:21 ` Darrick J. Wong
@ 2024-11-20 18:11 ` Keith Busch
2024-11-21 7:17 ` Christoph Hellwig
0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2024-11-20 18:11 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Dave Chinner, Pierre Labat, Kanchan Joshi,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Wed, Nov 20, 2024 at 09:21:58AM -0800, Darrick J. Wong wrote:
>
> How do filesystem users pick a write stream? I get a pretty strong
> sense that you're aiming to provide the ability for application software
> to group together a bunch of (potentially arbitrary) files in a cohort.
> Then (maybe?) you can say "This cohort of files are all expected to have
> data blocks related to each other in some fashion, so put them together
> so that the storage doesn't have to work so hard".
>
> Part of my comprehension problem here (and probably why few fs people
> commented on this thread) is that I have no idea what FDP is, or what
> the write lifetime hints in scsi were/are, or what the current "hinting"
> scheme is.
FDP is just the "new" version of NVMe's streams. Support for its
predecessor was added in commit f5d118406247acf ("nvme: add support for
streams")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5d118406247acfc4fc481e441e01ea4d6318fdc
Various applications were written to that interface and showed initial
promise, but production quality hardware never materialized. Some of
these applications are still setting the write hints today, and the
filesystems are all passing through the block stack, but there's just
currently no nvme driver listening on the other side.
Contrast to the older nvme streams, capable hardware subscribing to this
newer FDP scheme have been developed, and so people want to use those
same applications using those same hints in the exact same way that it
was originally designed. Enabling them could be just be a simple driver
patch like the above without bothering the filesystem people :)
> Is this what we're arguing about?
>
> enum rw_hint {
> WRITE_LIFE_NOT_SET = RWH_WRITE_LIFE_NOT_SET,
> WRITE_LIFE_NONE = RWH_WRITE_LIFE_NONE,
> WRITE_LIFE_SHORT = RWH_WRITE_LIFE_SHORT,
> WRITE_LIFE_MEDIUM = RWH_WRITE_LIFE_MEDIUM,
> WRITE_LIFE_LONG = RWH_WRITE_LIFE_LONG,
> WRITE_LIFE_EXTREME = RWH_WRITE_LIFE_EXTREME,
> } __packed;
>
> (What happens if you have two disjoint sets of files, both of which are
> MEDIUM, but they shouldn't be intertwined?)
It's not going to perform as well. You'd be advised against over
subscribing the hint value among applications with different relative
expectations, but it generally (but not always) should be no worse than
if you hadn't given any hints at all.
> Or are these new fdp hint things an overload of the existing write hint
> fields in the iocb/inode/bio? With a totally different meaning from
> anticipated lifetime of the data blocks?
The meaning assigned to an FDP stream is whatever the user wants it to
mean. It's not strictly a lifetime hint, but that is certainly a valid
way to use them. The contract on the device's side is that writes to
one stream won't create media interfere or contention with writes to
other streams. This is the same as nvme's original streams, which for
some reason did not carry any of this controversy.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
2024-11-20 18:11 ` Keith Busch
@ 2024-11-21 7:17 ` Christoph Hellwig
0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2024-11-21 7:17 UTC (permalink / raw)
To: Keith Busch
Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner, Pierre Labat,
Kanchan Joshi, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Wed, Nov 20, 2024 at 11:11:12AM -0700, Keith Busch wrote:
> Various applications were written to that interface and showed initial
> promise, but production quality hardware never materialized.
FYI, production grade NVMe streams hardware did materialize and is still
is shipped and used by multiple storage OEMS. Like most things in
enterprise storage you're unlikely to see it outside the firmware builds
for those few customers that actually require and QAed it.
> Some of
> these applications are still setting the write hints today, and the
> filesystems are all passing through the block stack, but there's just
> currently no nvme driver listening on the other side.
The only source available application we could fine that is using these
hints is rocksb, which got the fcntl interface wrong so that it did not
have a chance to actually work until Hans fixed it recently. Once he
fixed it, it shows great results when using file system based hinting,
although it still needs tuning to align it's internal file size to
the hardware reclaim unit size, i.e. it either needs behind the scenes
knowledge or an improved interface to be properly optimized.
> The meaning assigned to an FDP stream is whatever the user wants it to
> mean. It's not strictly a lifetime hint, but that is certainly a valid
> way to use them. The contract on the device's side is that writes to
> one stream won't create media interfere or contention with writes to
> other streams. This is the same as nvme's original streams, which for
> some reason did not carry any of this controversy.
Maybe people realized how badly that works outside a few very special
purpose uses?
I've said it before, but if you really do want to bypass the file
systems (and there's very good reasons for that for some workloads),
bypass it entirely. Don't try to micro-manage the layer below the
file system from the application without any chance for the file system
to even be in the known.
The entire discussion also seems to treat file systems as simple
containers for blocks that are static. While that is roughly true
for a lot of older file system designs, once you implement things
like snapshots, data checksums, data journalling or in general
flash friendly metadata write patterns that is very wrong, and
the file systems will want to be able to separate write streams
independently of the pure application write streams.
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2024-11-21 7:17 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-08 19:36 [PATCHv11 0/9] write hints with nvme fdp and scsi streams Keith Busch
2024-11-08 19:36 ` [PATCHv11 1/9] block: use generic u16 for write hints Keith Busch
2024-11-08 19:36 ` [PATCHv11 2/9] block: introduce max_write_hints queue limit Keith Busch
2024-11-08 19:36 ` [PATCHv11 3/9] statx: add write hint information Keith Busch
2024-11-08 19:36 ` [PATCHv11 4/9] block: allow ability to limit partition write hints Keith Busch
2024-11-08 19:36 ` [PATCHv11 5/9] block, fs: add write hint to kiocb Keith Busch
2024-11-08 19:36 ` [PATCHv11 6/9] io_uring: enable per-io hinting capability Keith Busch
2024-11-08 19:36 ` [PATCHv11 7/9] block: export placement hint feature Keith Busch
2024-11-11 10:29 ` [PATCHv11 0/9] write hints with nvme fdp and scsi streams Christoph Hellwig
2024-11-11 16:27 ` Keith Busch
2024-11-11 16:34 ` Christoph Hellwig
2024-11-12 13:26 ` Kanchan Joshi
2024-11-12 13:34 ` Christoph Hellwig
2024-11-12 14:25 ` Keith Busch
2024-11-12 16:50 ` Christoph Hellwig
2024-11-12 17:19 ` Christoph Hellwig
2024-11-12 18:18 ` [EXT] " Pierre Labat
2024-11-13 4:47 ` Christoph Hellwig
2024-11-13 23:51 ` Dave Chinner
2024-11-14 3:09 ` Martin K. Petersen
2024-11-14 6:07 ` Christoph Hellwig
2024-11-15 16:28 ` Keith Busch
2024-11-15 16:53 ` Christoph Hellwig
2024-11-18 23:37 ` Keith Busch
2024-11-19 7:15 ` Christoph Hellwig
2024-11-20 17:21 ` Darrick J. Wong
2024-11-20 18:11 ` Keith Busch
2024-11-21 7:17 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox