public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHv10 0/9] write hints with nvme fdp, scsi streams
@ 2024-10-29 15:19 Keith Busch
  2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

Changes from v9:

  Document the partition hint mask

  Use bitmap_alloc API

  Fixup bitmap memory leak

  Return invalid value if user requests an invalid write hint

  Added and exported a block device feature flag for indicating generic
  placement hint support

  Added statx write hint max field

  Added BUILD_BUG_ON check for new io_uring SQE fields.

  Added reviews

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (7):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  statx: add write hint information
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  block: export placement hint feature
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block | 13 +++++
 block/bdev.c                         | 18 ++++++
 block/blk-settings.c                 |  5 ++
 block/blk-sysfs.c                    |  6 ++
 block/fops.c                         | 31 +++++++++-
 block/partitions/core.c              | 44 ++++++++++++++-
 drivers/nvme/host/core.c             | 84 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 fs/stat.c                            |  1 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 15 +++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/linux/stat.h                 |  1 +
 include/uapi/linux/io_uring.h        |  4 ++
 include/uapi/linux/stat.h            |  3 +-
 io_uring/io_uring.c                  |  2 +
 io_uring/rw.c                        |  3 +-
 20 files changed, 253 insertions(+), 11 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCHv10 1/9] block: use generic u16 for write hints
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 17:21   ` Bart Van Assche
  2024-10-29 15:19 ` [PATCHv10 2/9] block: introduce max_write_hints queue limit Keith Busch
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition. Using this type doesn't change
the size of either bio or request.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/blk-mq.h    | 3 +--
 include/linux/blk_types.h | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2035fad3131fb..08ed7b5c4dbbf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,7 +8,6 @@
 #include <linux/scatterlist.h>
 #include <linux/prefetch.h>
 #include <linux/srcu.h>
-#include <linux/rw_hint.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -156,7 +155,7 @@ struct request {
 	struct blk_crypto_keyslot *crypt_keyslot;
 #endif
 
-	enum rw_hint write_hint;
+	unsigned short write_hint;
 	unsigned short ioprio;
 
 	enum mq_rq_state state;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..6737795220e18 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -10,7 +10,6 @@
 #include <linux/bvec.h>
 #include <linux/device.h>
 #include <linux/ktime.h>
-#include <linux/rw_hint.h>
 
 struct bio_set;
 struct bio;
@@ -219,7 +218,7 @@ struct bio {
 						 */
 	unsigned short		bi_flags;	/* BIO_* below */
 	unsigned short		bi_ioprio;
-	enum rw_hint		bi_write_hint;
+	unsigned short		bi_write_hint;
 	blk_status_t		bi_status;
 	atomic_t		__bi_remaining;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 2/9] block: introduce max_write_hints queue limit
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:19 ` [PATCHv10 3/9] statx: add write hint information Keith Busch
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.

Signed-off-by: Keith Busch <[email protected]>
---
 Documentation/ABI/stable/sysfs-block |  7 +++++++
 block/blk-settings.c                 |  3 +++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               | 12 ++++++++++++
 4 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8353611107154..f2db2cabb8e75 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
 		[RO] Maximum size in bytes of a single element in a DMA
 		scatter/gather list.
 
+What:		/sys/block/<disk>/queue/max_write_hints
+Date:		October 2024
+Contact:	[email protected]
+Description:
+		[RO] Maximum number of write hints supported, 0 if not
+		supported. If supported, valid values are 1 through
+		max_write_hints, inclusive.
 
 What:		/sys/block/<disk>/queue/max_segments
 Date:		March 2010
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a446654ddee5e..921fb4d334fa4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
 
 	/* Inherit limits from component devices */
+	lim->max_write_hints = USHRT_MAX;
 	lim->max_segments = USHRT_MAX;
 	lim->max_discard_segments = USHRT_MAX;
 	lim->max_hw_sectors = UINT_MAX;
@@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
+
 	alignment = queue_limit_alignment_offset(b, start);
 
 	/* Bottom device has different alignment.  Check that it is
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741b95dfdbf6f..85f48ca461049 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_hints)
 QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
 QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
 QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -457,6 +458,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
 QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
 QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
 QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_hints, "max_write_hints");
 QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
 
 QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -591,6 +593,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_max_discard_segments_entry.attr,
 	&queue_max_integrity_segments_entry.attr,
 	&queue_max_segment_size_entry.attr,
+	&queue_max_write_hints_entry.attr,
 	&queue_hw_sector_size_entry.attr,
 	&queue_logical_block_size_entry.attr,
 	&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d0a52ed05e60c..90a19ea889276 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -394,6 +394,8 @@ struct queue_limits {
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
 
+	unsigned short		max_write_hints;
+
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
 
@@ -1189,6 +1191,11 @@ static inline unsigned short queue_max_segments(const struct request_queue *q)
 	return q->limits.max_segments;
 }
 
+static inline unsigned short queue_max_write_hints(struct request_queue *q)
+{
+	return q->limits.max_write_hints;
+}
+
 static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
 {
 	return q->limits.max_discard_segments;
@@ -1236,6 +1243,11 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
 	return queue_max_segments(bdev_get_queue(bdev));
 }
 
+static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
+{
+	return queue_max_write_hints(bdev_get_queue(bdev));
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	return q->limits.logical_block_size;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 3/9] statx: add write hint information
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
  2024-10-29 15:19 ` [PATCHv10 2/9] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

If requested on a raw block device, report the maximum write hint the
block device supports.

Signed-off-by: Keith Busch <[email protected]>
---
 block/bdev.c              | 5 +++++
 fs/stat.c                 | 1 +
 include/linux/stat.h      | 1 +
 include/uapi/linux/stat.h | 3 ++-
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..9a59f0c882170 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1296,6 +1296,11 @@ void bdev_statx(struct path *path, struct kstat *stat,
 		stat->result_mask |= STATX_DIOALIGN;
 	}
 
+	if (request_mask & STATX_WRITE_HINT) {
+		stat->write_hint_max = bdev_max_write_hints(bdev);
+		stat->result_mask |= STATX_WRITE_HINT;
+	}
+
 	if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) {
 		struct request_queue *bd_queue = bdev->bd_queue;
 
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e3..60bcd5c2e2a1d 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -704,6 +704,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
 	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 	tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
+	tmp.stx_write_hint_max = stat->write_hint_max;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 3d900c86981c5..48f0f64846a02 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -57,6 +57,7 @@ struct kstat {
 	u32		atomic_write_unit_min;
 	u32		atomic_write_unit_max;
 	u32		atomic_write_segments_max;
+	u32		write_hint_max;
 };
 
 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a252864416..10f5622c21113 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -132,7 +132,7 @@ struct statx {
 	__u32	stx_atomic_write_unit_max;	/* Max atomic write unit in bytes */
 	/* 0xb0 */
 	__u32   stx_atomic_write_segments_max;	/* Max atomic write segment count */
-	__u32   __spare1[1];
+	__u32   stx_write_hint_max;
 	/* 0xb8 */
 	__u64	__spare3[9];	/* Spare space for future expansion */
 	/* 0x100 */
@@ -164,6 +164,7 @@ struct statx {
 #define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
 #define STATX_SUBVOL		0x00008000U	/* Want/got stx_subvol */
 #define STATX_WRITE_ATOMIC	0x00010000U	/* Want/got atomic_write_* fields */
+#define STATX_WRITE_HINT	0x00020000U	/* Want/got write_hint_max */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (2 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 3/9] statx: add write hint information Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:23   ` Christoph Hellwig
  2024-10-29 17:25   ` Bart Van Assche
  2024-10-29 15:19 ` [PATCHv10 5/9] block, fs: add write hint to kiocb Keith Busch
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.

Signed-off-by: Keith Busch <[email protected]>
---
 Documentation/ABI/stable/sysfs-block |  6 ++++
 block/bdev.c                         | 13 ++++++++
 block/partitions/core.c              | 44 ++++++++++++++++++++++++++--
 include/linux/blk_types.h            |  1 +
 4 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index f2db2cabb8e75..8ab9f030e61d1 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -187,6 +187,12 @@ Description:
 		partition is offset from the internal allocation unit's
 		natural alignment.
 
+What:		/sys/block/<disk>/<partition>/write_hint_mask
+Date:		October 2024
+Contact:	[email protected]
+Description:
+		The mask of allowed write hints. You can limit which hints the
+		block layer will use by writing a new mask.
 
 What:		/sys/block/<disk>/<partition>/stat
 Date:		February 2008
diff --git a/block/bdev.c b/block/bdev.c
index 9a59f0c882170..0c5e87b111aed 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -415,6 +415,7 @@ void __init bdev_cache_init(void)
 struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 {
 	struct block_device *bdev;
+	unsigned short write_hint;
 	struct inode *inode;
 
 	inode = new_inode(blockdev_superblock);
@@ -440,6 +441,18 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		return NULL;
 	}
 	bdev->bd_disk = disk;
+
+	write_hint = bdev_max_write_hints(bdev);
+	if (write_hint) {
+		bdev->write_hint_mask = bitmap_alloc(write_hint, GFP_KERNEL);
+		if (!bdev->write_hint_mask) {
+			free_percpu(bdev->bd_stats);
+			iput(inode);
+			return NULL;
+		}
+		bitmap_set(bdev->write_hint_mask, 0, write_hint);
+	}
+
 	return bdev;
 }
 
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 815ed33caa1b8..24e1a79972f75 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -203,6 +203,40 @@ static ssize_t part_discard_alignment_show(struct device *dev,
 	return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
 }
 
+static ssize_t part_write_hint_mask_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+
+	if (!max_write_hints)
+		return sprintf(buf, "0");
+	return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
+}
+
+static ssize_t part_write_hint_mask_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+	unsigned long *new_mask;
+
+	if (!max_write_hints)
+		return count;
+
+	new_mask = bitmap_alloc(max_write_hints, GFP_KERNEL);
+	if (!new_mask)
+		return -ENOMEM;
+
+	bitmap_parse(buf, count, new_mask, max_write_hints);
+	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
+	bitmap_free(new_mask);
+
+	return count;
+}
+
 static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
 static DEVICE_ATTR(start, 0444, part_start_show, NULL);
 static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -211,6 +245,8 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
 static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
 static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(write_hint_mask, 0644, part_write_hint_mask_show,
+		   part_write_hint_mask_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
 	__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -225,6 +261,7 @@ static struct attribute *part_attrs[] = {
 	&dev_attr_discard_alignment.attr,
 	&dev_attr_stat.attr,
 	&dev_attr_inflight.attr,
+	&dev_attr_write_hint_mask.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
 #endif
@@ -245,8 +282,11 @@ static const struct attribute_group *part_attr_groups[] = {
 
 static void part_release(struct device *dev)
 {
-	put_disk(dev_to_bdev(dev)->bd_disk);
-	bdev_drop(dev_to_bdev(dev));
+	struct block_device *part = dev_to_bdev(dev);
+
+	bitmap_free(part->write_hint_mask);
+	put_disk(part->bd_disk);
+	bdev_drop(part);
 }
 
 static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6737795220e18..af430e543f7f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -73,6 +73,7 @@ struct block_device {
 #ifdef CONFIG_SECURITY
 	void			*bd_security;
 #endif
+	unsigned long		*write_hint_mask;
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast
 	 * path
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 5/9] block, fs: add write hint to kiocb
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (3 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's allowed hints.

Signed-off-by: Keith Busch <[email protected]>
---
 block/fops.c       | 31 ++++++++++++++++++++++++++++---
 include/linux/fs.h |  1 +
 2 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 2d01c90076813..bb3855ee044f0 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -71,7 +71,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -200,7 +200,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -316,7 +316,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
@@ -362,6 +362,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	return -EIOCBQUEUED;
 }
 
+static int blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
+{
+	u16 hint = iocb->ki_write_hint;
+
+	if (!hint)
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	if (hint > bdev_max_write_hints(bdev))
+		return -EINVAL;
+
+	if (bdev_is_partition(bdev) &&
+	    !test_bit(hint - 1, bdev->write_hint_mask))
+		return -EINVAL;
+
+	return hint;
+}
+
 static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
@@ -373,6 +390,14 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (blkdev_dio_invalid(bdev, iocb, iter))
 		return -EINVAL;
 
+	if (iov_iter_rw(iter) == WRITE) {
+		int hint = blkdev_write_hint(iocb, bdev);
+
+		if (hint < 0)
+			return hint;
+		iocb->ki_write_hint = hint;
+	}
+
 	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
 	if (likely(nr_pages <= BIO_MAX_VECS)) {
 		if (is_sync_kiocb(iocb))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a1268..1a00accf412e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	u16			ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 6/9] io_uring: enable per-io hinting capability
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (4 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 5/9] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:19 ` [PATCHv10 7/9] block: export placement hint feature Keith Busch
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche,
	Nitesh Shetty, Keith Busch

From: Kanchan Joshi <[email protected]>

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes will be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.

Allow userspace to pass additional metadata in the SQE.

	__u16 write_hint;

If the hint is provided, filesystems may optionally use it. A filesytem
may ignore this field if it does not support per-io hints, or if the
value is invalid for its backing storage. Just like the inode hints,
requesting values that are not supported by the hardware are not an
error.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 include/uapi/linux/io_uring.h | 4 ++++
 io_uring/io_uring.c           | 2 ++
 io_uring/rw.c                 | 3 ++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0247452837830..6e1985d3b306c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@ struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			__u16	write_hint;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4514644fdf52e..65b8e29b67ec7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,7 +3875,9 @@ static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
 	BUILD_BUG_SQE_ELEM(44, __u32,  file_index);
 	BUILD_BUG_SQE_ELEM(44, __u16,  addr_len);
+	BUILD_BUG_SQE_ELEM(44, __u16,  write_hint);
 	BUILD_BUG_SQE_ELEM(46, __u16,  __pad3[0]);
+	BUILD_BUG_SQE_ELEM(46, __u16,  __pad4[0]);
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 7ce1cbc048faf..b5dea58356d93 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
-
+	if (ddir == ITER_SOURCE)
+		rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 7/9] block: export placement hint feature
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (5 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

Add a feature flag for devices that support generic placement hints in
write commands. This is in contrast to data lifetime hints.

Signed-off-by: Keith Busch <[email protected]>
---
 block/blk-settings.c   | 2 ++
 block/blk-sysfs.c      | 3 +++
 include/linux/blkdev.h | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 921fb4d334fa4..7e3bb3cec7032 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -518,6 +518,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_PLACEMENT_HINTS))
+		t->features &= ~BLK_FEAT_PLACEMENT_HINTS;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 85f48ca461049..51e0b99c2210a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -260,6 +260,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page)	\
 QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
 QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
 QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(placement_hints, BLK_FEAT_PLACEMENT_HINTS);
 
 static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
 {
@@ -497,6 +498,7 @@ QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay");
 QUEUE_RW_ENTRY(queue_wc, "write_cache");
 QUEUE_RO_ENTRY(queue_fua, "fua");
 QUEUE_RO_ENTRY(queue_dax, "dax");
+QUEUE_RO_ENTRY(queue_placement_hints, "placement_hints");
 QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
 QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
 QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
@@ -626,6 +628,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_wc_entry.attr,
 	&queue_fua_entry.attr,
 	&queue_dax_entry.attr,
+	&queue_placement_hints_entry.attr,
 	&queue_poll_delay_entry.attr,
 	&queue_virt_boundary_mask_entry.attr,
 	&queue_dma_alignment_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 90a19ea889276..c577a607591de 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -333,6 +333,9 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \
 	((__force blk_features_t)(1u << 15))
 
+/* supports generic write placement hints */
+#define BLK_FEAT_PLACEMENT_HINTS	((__force blk_features_t)(1u << 16))
+
 /*
  * Flags automatically inherited when stacking limits.
  */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 8/9] nvme: enable FDP support
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (6 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 7/9] block: export placement hint feature Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-30  0:24   ` Chaitanya Kulkarni
  2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
  2024-10-29 15:24 ` [PATCHv10 0/9] write hints with nvme fdp, scsi streams Christoph Hellwig
  9 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Hui Qi,
	Nitesh Shetty, Hannes Reinecke, Keith Busch

From: Kanchan Joshi <[email protected]>

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.
Userspace can send the write hint information using io_uring or fcntl.

Fetch the placement-identifiers if the device supports FDP. The incoming
write-hint is mapped to a placement-identifier, which in turn is set in
the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  5 +++
 include/linux/nvme.h     | 19 +++++++++
 3 files changed, 108 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3de7555a7de74..bd7b89912ddb9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
 	bool is_removed;
 };
 
+struct nvme_fdp_ruh_status_desc {
+	u16 pid;
+	u16 ruhid;
+	u32 earutr;
+	u64 ruamw;
+	u8  rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+	u8  rsvd0[14];
+	__le16 nruhsd;
+	struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
 	ida_free(&head->subsys->ns_ida, head->instance);
 	cleanup_srcu_struct(&head->srcu);
 	nvme_put_subsystem(head->subsys);
+	kfree(head->plids);
 	kfree(head);
 }
 
@@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
+	if (req->write_hint && ns->head->nr_plids) {
+		u16 hint = max(req->write_hint, ns->head->nr_plids);
+
+		dsmgmt |= ns->head->plids[hint - 1] << 16;
+		control |= NVME_RW_DTYPE_DPLCMT;
+	}
+
 	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
 		return BLK_STS_INVAL;
 
@@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	return ret;
 }
 
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+	struct nvme_fdp_ruh_status_desc *ruhsd;
+	struct nvme_ns_head *head = ns->head;
+	struct nvme_fdp_ruh_status *ruhs;
+	struct nvme_command c = {};
+	int size, ret, i;
+
+	if (head->plids)
+		return 0;
+
+	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+	ruhs = kzalloc(size, GFP_KERNEL);
+	if (!ruhs)
+		return -ENOMEM;
+
+	c.imr.opcode = nvme_cmd_io_mgmt_recv;
+	c.imr.nsid = cpu_to_le32(nsid);
+	c.imr.mo = 0x1;
+	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+	if (ret)
+		goto out;
+
+	i = le16_to_cpu(ruhs->nruhsd);
+	if (!i)
+		goto out;
+
+	ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
+	head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
+			      GFP_KERNEL);
+	if (!head->plids) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < ns->head->nr_plids; i++) {
+		ruhsd = &ruhs->ruhsd[i];
+		head->plids[i] = le16_to_cpu(ruhsd->pid);
+	}
+out:
+	kfree(ruhs);
+	return ret;
+}
+
 static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		struct nvme_ns_info *info)
 {
@@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 			goto out;
 	}
 
+	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+		ret = nvme_fetch_fdp_plids(ns, info->nsid);
+		if (ret)
+			dev_warn(ns->ctrl->device,
+				"FDP failure status:0x%x\n", ret);
+		if (ret < 0)
+			goto out;
+	} else {
+		ns->head->nr_plids = 0;
+		kfree(ns->head->plids);
+		ns->head->plids = NULL;
+	}
+
 	blk_mq_freeze_queue(ns->disk->queue);
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
@@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	if (!nvme_init_integrity(ns->head, &lim, info))
 		capacity = 0;
 
+	lim.max_write_hints = ns->head->nr_plids;
+	if (lim.max_write_hints)
+		lim.features |= BLK_FEAT_PLACEMENT_HINTS;
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
 	if (ret) {
 		blk_mq_unfreeze_queue(ns->disk->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536b..cec8e5d96377b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,9 @@ struct nvme_ns_head {
 	struct device		cdev_device;
 
 	struct gendisk		*disk;
+
+	u16                     nr_plids;
+	u16			*plids;
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e0..a954eaee5b0f3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
 	NVME_CTRL_ATTR_HID_128_BIT	= (1 << 0),
 	NVME_CTRL_ATTR_TBKAS		= (1 << 6),
 	NVME_CTRL_ATTR_ELBAS		= (1 << 15),
+	NVME_CTRL_ATTR_FDPS		= (1 << 19),
 };
 
 struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_register	= 0x0d,
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_io_mgmt_recv	= 0x12,
 	nvme_cmd_resv_release	= 0x15,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
+		nvme_opcode_name(nvme_cmd_io_mgmt_recv),	\
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
@@ -1015,6 +1018,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
 	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+	NVME_RW_DTYPE_DPLCMT		= 2 << 4,
 	NVME_WZ_DEAC			= 1 << 9,
 };
 
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_io_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			mo;
+	__u8			rsvd11;
+	__u16			mos;
+	__le32			numd;
+	__le32			cdw12[4];
+};
+
 enum {
 	NVME_ZRA_ZONE_REPORT		= 0,
 	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (7 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
  2024-10-29 15:26   ` Christoph Hellwig
  2024-10-29 15:24 ` [PATCHv10 0/9] write hints with nvme fdp, scsi streams Christoph Hellwig
  9 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch,
	Hannes Reinecke

From: Keith Busch <[email protected]>

The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.

Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 drivers/scsi/sd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76adc..235dd6e5b6688 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_config_protection(sdkp, &lim);
 	}
 
+	lim.max_write_hints = sdkp->permanent_stream_count;
+
 	/*
 	 * We now have all cache related info, determine how we deal
 	 * with flush requests.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-29 15:23   ` Christoph Hellwig
  2024-10-29 17:25   ` Bart Van Assche
  1 sibling, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Tue, Oct 29, 2024 at 08:19:17AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.

This still offers all hints to all partitions by default, which still
breaks all use cases where you have actual users of the stream separation
on multiple paritions.

Please assign either all resources to the first partition and none to
the others (probably the easiest and useful for the most common use
case) or split it evenly.

Note that the bdev_max_streams value also needs to be adjusted to only
return the number of streams actually available to a partition.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
  2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
                   ` (8 preceding siblings ...)
  2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-29 15:24 ` Christoph Hellwig
  9 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote:
>   Return invalid value if user requests an invalid write hint
> 
>   Added and exported a block device feature flag for indicating generic
>   placement hint support

But it still talks of write hints everywhere and conflates the write
streams with the temperature hints which are completely different
beasts.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-29 15:26   ` Christoph Hellwig
  2024-10-29 15:34     ` Keith Busch
  2024-10-29 17:18     ` Bart Van Assche
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke

On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Reviewed-by: Bart Van Assche <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>

Despite the reviews this is still incorrect.  The permanent streams have
a relative data temperature associated with them as pointed out last
round and are not arbitrary write stream contexts despite (ab)using
the SBC streams facilities.

Bart, btw: I think the current sd implementation is buggy as well, as
it assumes the permanent streams are ordered by their data temperature
in the IO Advise hints mode page, but I can't find anything in the
spec that requires a particular ordering.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:26   ` Christoph Hellwig
@ 2024-10-29 15:34     ` Keith Busch
  2024-10-29 15:37       ` Christoph Hellwig
  2024-10-29 17:18     ` Bart Van Assche
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 04:26:54PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> > 
> > The block limits exports the number of write hints, so set this limit if
> > the device reports support for the lifetime hints. Not only does this
> > inform the user of which hints are possible, it also allows scsi devices
> > supporting the feature to utilize the full range through raw block
> > device direct-io.
> > 
> > Reviewed-by: Bart Van Assche <[email protected]>
> > Reviewed-by: Hannes Reinecke <[email protected]>
> > Signed-off-by: Keith Busch <[email protected]>
> 
> Despite the reviews this is still incorrect.  The permanent streams have
> a relative data temperature associated with them as pointed out last
> round and are not arbitrary write stream contexts despite (ab)using
> the SBC streams facilities.

So then don't use it that way? I still don't know what change you're
expecting to happen with this feedback. What do you want the kernel to
do differently here?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:34     ` Keith Busch
@ 2024-10-29 15:37       ` Christoph Hellwig
  2024-10-29 15:38         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> So then don't use it that way? I still don't know what change you're
> expecting to happen with this feedback. What do you want the kernel to
> do differently here?

Same as before:  don't expose them as write streams, because they
aren't.  A big mess in this series going back to the versions before
your involvement is that they somehow want to tie up the temperature
hints with the stream separation, which just ends up very messy.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:37       ` Christoph Hellwig
@ 2024-10-29 15:38         ` Keith Busch
  2024-10-29 15:53           ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 15:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 04:37:02PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> > So then don't use it that way? I still don't know what change you're
> > expecting to happen with this feedback. What do you want the kernel to
> > do differently here?
> 
> Same as before:  don't expose them as write streams, because they
> aren't.  A big mess in this series going back to the versions before
> your involvement is that they somehow want to tie up the temperature
> hints with the stream separation, which just ends up very messy.

They're not exposed as write streams. Patch 7/9 sets the feature if it
is a placement id or not, and only nvme sets it, so scsi's attributes
are not claiming to be a write stream.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:38         ` Keith Busch
@ 2024-10-29 15:53           ` Christoph Hellwig
  2024-10-29 16:22             ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> They're not exposed as write streams. Patch 7/9 sets the feature if it
> is a placement id or not, and only nvme sets it, so scsi's attributes
> are not claiming to be a write stream.

So it shows up in sysfs, but:

 - queue_max_write_hints (which really should be queue_max_write_streams)
   still picks it up, and from there the statx interface

 - per-inode fcntl hint that encode a temperature still magically
   get dumpted into the write streams if they are set.

In other words it's a really leaky half-backed abstraction.

Let's brainstorm how it could be done better:

 - the max_write_streams values only set by block devices that actually
   do support write streams, and not the fire and forget temperature
   hints.  They way this is queried is by having a non-zero value
   there, not need for an extra flag.
 - but the struct file (or maybe inode) gets a supported flag, as stream
   separation needs to be supported by the file system 
 - a separate fcntl is used to set per-inode streams (if you care about
   that, seem like the bdev use case focusses on per-I/O).  In that case
   we'd probably also need a separate inode field for them, or a somewhat
   complicated scheme to decide what is stored in the inode field if there
   is only one.
 - for block devices bdev/fops.c maps the temperature hints into write
   streams if write streams are supported, any user that mixes and
   matches write streams and temperature hints gets what they deserve
 - this could also be a helper for file systems that want to do the
   same.

Just a quick writeup while I'm on the run, there's probably a hole or
two that could be poked into it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:53           ` Christoph Hellwig
@ 2024-10-29 16:22             ` Keith Busch
  2024-10-30  4:55               ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-29 16:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > is a placement id or not, and only nvme sets it, so scsi's attributes
> > are not claiming to be a write stream.
> 
> So it shows up in sysfs, but:
> 
>  - queue_max_write_hints (which really should be queue_max_write_streams)
>    still picks it up, and from there the statx interface
> 
>  - per-inode fcntl hint that encode a temperature still magically
>    get dumpted into the write streams if they are set.
> 
> In other words it's a really leaky half-backed abstraction.

Exactly why I asked last time: "who uses it and how do you want them to
use it" :)
 
> Let's brainstorm how it could be done better:
> 
>  - the max_write_streams values only set by block devices that actually
>    do support write streams, and not the fire and forget temperature
>    hints.  They way this is queried is by having a non-zero value
>    there, not need for an extra flag.

So we need a completely different attribute for SCSI's permanent write
streams? You'd mentioned earlier you were okay with having SCSI be able
to utilized per-io raw block write hints. Having multiple things to
check for what are all just write classifiers seems unnecessarily
complicated.

>  - but the struct file (or maybe inode) gets a supported flag, as stream
>    separation needs to be supported by the file system 
>  - a separate fcntl is used to set per-inode streams (if you care about
>    that, seem like the bdev use case focusses on per-I/O).  In that case
>    we'd probably also need a separate inode field for them, or a somewhat
>    complicated scheme to decide what is stored in the inode field if there
>    is only one.

No need to create a new fcntl. The people already testing this are
successfully using FDP with the existing fcntl hints. Their applications
leverage FDP as way to separate files based on expected lifetime. It is
how they want to use it and it is working above expectations. 

>  - for block devices bdev/fops.c maps the temperature hints into write
>    streams if write streams are supported, any user that mixes and
>    matches write streams and temperature hints gets what they deserve

That's fine. This patch series pretty much accomplishes that part.

>  - this could also be a helper for file systems that want to do the
>    same.
> 
> Just a quick writeup while I'm on the run, there's probably a hole or
> two that could be poked into it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 15:26   ` Christoph Hellwig
  2024-10-29 15:34     ` Keith Busch
@ 2024-10-29 17:18     ` Bart Van Assche
  2024-10-30  5:42       ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:18 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel,
	joshi.k, javier.gonz, Keith Busch, Hannes Reinecke

On 10/29/24 8:26 AM, Christoph Hellwig wrote:
> Bart, btw: I think the current sd implementation is buggy as well, as
> it assumes the permanent streams are ordered by their data temperature
> in the IO Advise hints mode page, but I can't find anything in the
> spec that requires a particular ordering.

How about modifying sd_read_io_hints() such that permanent stream
information is ignored if the order of the RELATIVE LIFETIME information
reported by the GET STREAM STATUS command does not match the permanent
stream order?

Thanks,

Bart.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 41e2dfa2d67d..277035febd82 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3192,7 +3192,12 @@ sd_read_cache_type(struct scsi_disk *sdkp, 
unsigned char *buffer)
  	sdkp->DPOFUA = 0;
  }

-static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int 
stream_id)
+/*
+ * Returns the relative lifetime of a permanent stream. Returns -1 if the
+ * GET STREAM STATUS command fails or if the stream is not a permanent 
stream.
+ */
+static int sd_perm_stream_rel_lifetime(struct scsi_disk *sdkp,
+				       unsigned int stream_id)
  {
  	u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS };
  	struct {
@@ -3212,14 +3217,16 @@ static bool sd_is_perm_stream(struct scsi_disk 
*sdkp, unsigned int stream_id)
  	res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf),
  			       SD_TIMEOUT, sdkp->max_retries, &exec_args);
  	if (res < 0)
-		return false;
+		return -1;
  	if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr))
  		sd_print_sense_hdr(sdkp, &sshdr);
  	if (res)
-		return false;
+		return -1;
  	if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status))
-		return false;
-	return buf.h.stream_status[0].perm;
+		return -1;
+	if (!buf.h.stream_status[0].perm)
+		return -1;
+	return buf.h.stream_status[0].rel_lifetime;
  }

  static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char 
*buffer)
@@ -3247,9 +3254,17 @@ static void sd_read_io_hints(struct scsi_disk 
*sdkp, unsigned char *buffer)
  	 * should assign the lowest numbered stream identifiers to permanent
  	 * streams.
  	 */
-	for (desc = start; desc < end; desc++)
-		if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start))
+	int prev_rel_lifetime = -1;
+	for (desc = start; desc < end; desc++) {
+		int rel_lifetime;
+
+		if (!desc->st_enble)
  			break;
+		rel_lifetime = sd_perm_stream_rel_lifetime(sdkp, desc - start);
+		if (rel_lifetime < 0 || rel_lifetime < prev_rel_lifetime)
+			break;
+		prev_rel_lifetime = rel_lifetime;
+	}
  	permanent_stream_count_old = sdkp->permanent_stream_count;
  	sdkp->permanent_stream_count = desc - start;
  	if (sdkp->rscs && sdkp->permanent_stream_count < 2)


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 1/9] block: use generic u16 for write hints
  2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
@ 2024-10-29 17:21   ` Bart Van Assche
  0 siblings, 0 replies; 42+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:21 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/29/24 8:19 AM, Keith Busch wrote:
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition. Using this type doesn't change
> the size of either bio or request.

Reviewed-by: Bart Van Assche <[email protected]>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
  2024-10-29 15:23   ` Christoph Hellwig
@ 2024-10-29 17:25   ` Bart Van Assche
  2024-10-30  4:46     ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:25 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/29/24 8:19 AM, Keith Busch wrote:
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +	unsigned long *new_mask;
> +
> +	if (!max_write_hints)
> +		return count;
> +
> +	new_mask = bitmap_alloc(max_write_hints, GFP_KERNEL);
> +	if (!new_mask)
> +		return -ENOMEM;
> +
> +	bitmap_parse(buf, count, new_mask, max_write_hints);
> +	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
> +	bitmap_free(new_mask);
> +
> +	return count;
> +}

bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
serialized against the code that tests bits in bdev->write_hint_mask?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 8/9] nvme: enable FDP support
  2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
@ 2024-10-30  0:24   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 42+ messages in thread
From: Chaitanya Kulkarni @ 2024-10-30  0:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], Hui Qi,
	Nitesh Shetty, Hannes Reinecke, Keith Busch,
	[email protected]

On 10/29/24 08:19, Keith Busch wrote:
> From: Kanchan Joshi <[email protected]>
>
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> Userspace can send the write hint information using io_uring or fcntl.
>
> Fetch the placement-identifiers if the device supports FDP. The incoming
> write-hint is mapped to a placement-identifier, which in turn is set in
> the DSPEC field of the write command.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Hui Qi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  5 +++
>   include/linux/nvme.h     | 19 +++++++++
>   3 files changed, 108 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3de7555a7de74..bd7b89912ddb9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -44,6 +44,20 @@ struct nvme_ns_info {
>   	bool is_removed;
>   };
>   
> +struct nvme_fdp_ruh_status_desc {
> +	u16 pid;
> +	u16 ruhid;
> +	u32 earutr;
> +	u64 ruamw;
> +	u8  rsvd16[16];
> +};
> +
> +struct nvme_fdp_ruh_status {
> +	u8  rsvd0[14];
> +	__le16 nruhsd;
> +	struct nvme_fdp_ruh_status_desc ruhsd[];
> +};
> +
>   unsigned int admin_timeout = 60;
>   module_param(admin_timeout, uint, 0644);
>   MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
> @@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
>   	ida_free(&head->subsys->ns_ida, head->instance);
>   	cleanup_srcu_struct(&head->srcu);
>   	nvme_put_subsystem(head->subsys);
> +	kfree(head->plids);
>   	kfree(head);
>   }
>   
> @@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   	if (req->cmd_flags & REQ_RAHEAD)
>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>   
> +	if (req->write_hint && ns->head->nr_plids) {
> +		u16 hint = max(req->write_hint, ns->head->nr_plids);
> +
> +		dsmgmt |= ns->head->plids[hint - 1] << 16;
> +		control |= NVME_RW_DTYPE_DPLCMT;
> +	}
> +
>   	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
>   		return BLK_STS_INVAL;
>   
> @@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
>   	return ret;
>   }
>   
> +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
> +{
> +	struct nvme_fdp_ruh_status_desc *ruhsd;
> +	struct nvme_ns_head *head = ns->head;
> +	struct nvme_fdp_ruh_status *ruhs;
> +	struct nvme_command c = {};
> +	int size, ret, i;
> +
> +	if (head->plids)
> +		return 0;
> +
> +	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
> +	ruhs = kzalloc(size, GFP_KERNEL);
> +	if (!ruhs)
> +		return -ENOMEM;
> +
> +	c.imr.opcode = nvme_cmd_io_mgmt_recv;
> +	c.imr.nsid = cpu_to_le32(nsid);
> +	c.imr.mo = 0x1;

can we please add some comment where values are hardcoded ?

> +	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
> +
> +	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
> +	if (ret)
> +		goto out;
> +
> +	i = le16_to_cpu(ruhs->nruhsd);

instead of i why can't we use local variable nr_plids ?



> +	if (!i)
> +		goto out;
> +
> +	ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
> +	head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
> +			      GFP_KERNEL);
> +	if (!head->plids) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < ns->head->nr_plids; i++) {
> +		ruhsd = &ruhs->ruhsd[i];
> +		head->plids[i] = le16_to_cpu(ruhsd->pid);
> +	}
> +out:
> +	kfree(ruhs);
> +	return ret;
> +}
> +
>   static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   		struct nvme_ns_info *info)
>   {
> @@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   			goto out;
>   	}
>   
> +	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
> +		ret = nvme_fetch_fdp_plids(ns, info->nsid);
> +		if (ret)
> +			dev_warn(ns->ctrl->device,
> +				"FDP failure status:0x%x\n", ret);
> +		if (ret < 0)
> +			goto out;
> +	} else {
> +		ns->head->nr_plids = 0;
> +		kfree(ns->head->plids);
> +		ns->head->plids = NULL;
> +	}
> +
>   	blk_mq_freeze_queue(ns->disk->queue);
>   	ns->head->lba_shift = id->lbaf[lbaf].ds;
>   	ns->head->nuse = le64_to_cpu(id->nuse);
> @@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   	if (!nvme_init_integrity(ns->head, &lim, info))
>   		capacity = 0;
>   
> +	lim.max_write_hints = ns->head->nr_plids;
> +	if (lim.max_write_hints)
> +		lim.features |= BLK_FEAT_PLACEMENT_HINTS;
>   	ret = queue_limits_commit_update(ns->disk->queue, &lim);
>   	if (ret) {
>   		blk_mq_unfreeze_queue(ns->disk->queue);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 093cb423f536b..cec8e5d96377b 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -454,6 +454,8 @@ struct nvme_ns_ids {
>   	u8	csi;
>   };
>   
> +#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))

this calculates how many plids can fit into the ctrl page size ?

sorry but I didn't understand sizeof(16) here, since plids are u16

nvme_ns_head -> u16 *plidsshould this be sizeof(u16) ? -ck


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-29 17:25   ` Bart Van Assche
@ 2024-10-30  4:46     ` Christoph Hellwig
  2024-10-30 20:11       ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30  4:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
>> +}
>
> bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> serialized against the code that tests bits in bdev->write_hint_mask?

It needs something.  I actually pointed that out last round, but forgot
about it again this time :)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 16:22             ` Keith Busch
@ 2024-10-30  4:55               ` Christoph Hellwig
  2024-10-30 15:41                 ` Keith Busch
  2024-10-30 16:59                 ` Bart Van Assche
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30  4:55 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > > is a placement id or not, and only nvme sets it, so scsi's attributes
> > > are not claiming to be a write stream.
> > 
> > So it shows up in sysfs, but:
> > 
> >  - queue_max_write_hints (which really should be queue_max_write_streams)
> >    still picks it up, and from there the statx interface
> > 
> >  - per-inode fcntl hint that encode a temperature still magically
> >    get dumpted into the write streams if they are set.
> > 
> > In other words it's a really leaky half-backed abstraction.
> 
> Exactly why I asked last time: "who uses it and how do you want them to
> use it" :)

For the temperature hints the only public user I known is rocksdb, and
that only started working when Hans fixed a brown paperbag bug in the
rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
something in the Android world does as well, maybe Bart knows more.

For the separate write streams the usage I want for them is poor mans
zones - e.g. write N LBAs sequentially into a separate write streams
and then eventually discard them together.  This will fit nicely into
f2fs and the pending xfs work as well as quite a few userspace storage
systems.  For that the file system or application needs to query
the number of available write streams (and in the bitmap world their
numbers of they are distontigous) and the size your can fit into the
"reclaim unit" in FDP terms.  I've not been bothering you much with
the latter as it is an easy retrofit once the I/O path bits lands.

> > Let's brainstorm how it could be done better:
> > 
> >  - the max_write_streams values only set by block devices that actually
> >    do support write streams, and not the fire and forget temperature
> >    hints.  They way this is queried is by having a non-zero value
> >    there, not need for an extra flag.
> 
> So we need a completely different attribute for SCSI's permanent write
> streams? You'd mentioned earlier you were okay with having SCSI be able
> to utilized per-io raw block write hints. Having multiple things to
> check for what are all just write classifiers seems unnecessarily
> complicated.

I don't think the multiple write streams interface applies to SCSIs
write streams, as they enforce a relative temperature, and they don't
have the concept of how much you can write into an "reclaim unit".

OTOH there isn't much you need to query for them anyway, as the
temperature hints have always been defined as pure hints with all
up and downsides of that.

> No need to create a new fcntl. The people already testing this are
> successfully using FDP with the existing fcntl hints. Their applications
> leverage FDP as way to separate files based on expected lifetime. It is
> how they want to use it and it is working above expectations. 

FYI, I think it's always fine and easy to map the temperature hits to
write streams if that's all the driver offers.  It loses a lot of the
capapilities, but as long as it doesn't enforce a lower level interface
that never exposes more that's fine.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-29 17:18     ` Bart Van Assche
@ 2024-10-30  5:42       ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30  5:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	Keith Busch, Hannes Reinecke

On Tue, Oct 29, 2024 at 10:18:31AM -0700, Bart Van Assche wrote:
> On 10/29/24 8:26 AM, Christoph Hellwig wrote:
>> Bart, btw: I think the current sd implementation is buggy as well, as
>> it assumes the permanent streams are ordered by their data temperature
>> in the IO Advise hints mode page, but I can't find anything in the
>> spec that requires a particular ordering.
>
> How about modifying sd_read_io_hints() such that permanent stream
> information is ignored if the order of the RELATIVE LIFETIME information
> reported by the GET STREAM STATUS command does not match the permanent
> stream order?

That seems odd as there is nothing implying that they should be ordered.
The logic thing to do would be to a little array mapping the linux
temperature levels to the streams ids.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30  4:55               ` Christoph Hellwig
@ 2024-10-30 15:41                 ` Keith Busch
  2024-10-30 15:45                   ` Christoph Hellwig
  2024-10-30 16:59                 ` Bart Van Assche
  1 sibling, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> 
> > No need to create a new fcntl. The people already testing this are
> > successfully using FDP with the existing fcntl hints. Their applications
> > leverage FDP as way to separate files based on expected lifetime. It is
> > how they want to use it and it is working above expectations. 
> 
> FYI, I think it's always fine and easy to map the temperature hits to
> write streams if that's all the driver offers.  It loses a lot of the
> capapilities, but as long as it doesn't enforce a lower level interface
> that never exposes more that's fine.

But that's just the v2 from this sequence:

https://lore.kernel.org/linux-nvme/[email protected]/

If you're okay with it now, then let's just go with that and I'm happy
continue iterating on the rest separately. 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 15:41                 ` Keith Busch
@ 2024-10-30 15:45                   ` Christoph Hellwig
  2024-10-30 15:48                     ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30 15:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> > 
> > > No need to create a new fcntl. The people already testing this are
> > > successfully using FDP with the existing fcntl hints. Their applications
> > > leverage FDP as way to separate files based on expected lifetime. It is
> > > how they want to use it and it is working above expectations. 
> > 
> > FYI, I think it's always fine and easy to map the temperature hits to
> > write streams if that's all the driver offers.  It loses a lot of the
> > capapilities, but as long as it doesn't enforce a lower level interface
> > that never exposes more that's fine.
> 
> But that's just the v2 from this sequence:
> 
> https://lore.kernel.org/linux-nvme/[email protected]/
> 
> If you're okay with it now, then let's just go with that and I'm happy
> continue iterating on the rest separately. 

That's exactly what I do not want - it takes the temperature hints
and force them into the write streams down in the driver with no
way to make actually useful use of the stream separation.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 15:45                   ` Christoph Hellwig
@ 2024-10-30 15:48                     ` Keith Busch
  2024-10-30 15:50                       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 04:45:56PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> > On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> > > 
> > > > No need to create a new fcntl. The people already testing this are
> > > > successfully using FDP with the existing fcntl hints. Their applications
> > > > leverage FDP as way to separate files based on expected lifetime. It is
> > > > how they want to use it and it is working above expectations. 
> > > 
> > > FYI, I think it's always fine and easy to map the temperature hits to
> > > write streams if that's all the driver offers.  It loses a lot of the
> > > capapilities, but as long as it doesn't enforce a lower level interface
> > > that never exposes more that's fine.
> > 
> > But that's just the v2 from this sequence:
> > 
> > https://lore.kernel.org/linux-nvme/[email protected]/
> > 
> > If you're okay with it now, then let's just go with that and I'm happy
> > continue iterating on the rest separately. 
> 
> That's exactly what I do not want - it takes the temperature hints
> and force them into the write streams down in the driver 

What??? You said to map the temperature hints to a write stream. The
driver offers that here. But you specifically don't want that? I'm so
confused.

> with no way to make actually useful use of the stream separation.

Have you tried it? The people who actually do easily demonstrate it is
in fact very useful.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 15:48                     ` Keith Busch
@ 2024-10-30 15:50                       ` Christoph Hellwig
  2024-10-30 16:42                         ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30 15:50 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> What??? You said to map the temperature hints to a write stream. The
> driver offers that here. But you specifically don't want that? I'm so
> confused.

In bdev/fops.c (or file systems if they want to do that) not down in the
driver forced down everyones throat.  Which was the whole point of the
discussion that we're running in circles here.

> > with no way to make actually useful use of the stream separation.
> 
> Have you tried it? The people who actually do easily demonstrate it is
> in fact very useful.

While I've read the claim multiple times, I've not actually seen any
numbers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 15:50                       ` Christoph Hellwig
@ 2024-10-30 16:42                         ` Keith Busch
  2024-10-30 16:57                           ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 16:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > What??? You said to map the temperature hints to a write stream. The
> > driver offers that here. But you specifically don't want that? I'm so
> > confused.
> 
> In bdev/fops.c (or file systems if they want to do that) not down in the
> driver forced down everyones throat.  Which was the whole point of the
> discussion that we're running in circles here.

That makes no sense. A change completely isolated to a driver isn't
forcing anything on anyone. It's the upper layers that's forcing this
down, whether the driver uses it or not: the hints are already getting
to the driver, but the driver currently doesn't use it. Finding a way to
use them is not some force to be demonized...

> > > with no way to make actually useful use of the stream separation.
> > 
> > Have you tried it? The people who actually do easily demonstrate it is
> > in fact very useful.
> 
> While I've read the claim multiple times, I've not actually seen any
> numbers.

Here's something recent from rocksdb developers running ycsb worklada
benchmark. The filesystem used is XFS.

It sets temperature hints for different SST levels, which already
happens today. The last data point made some minor changes with
level-to-hint mapping.

Without FDP:

WAF:        2.72
IOPS:       1465
READ LAT:   2681us
UPDATE LAT: 3115us

With FDP (rocksdb unmodified):

WAF:        2.26
IOPS:       1473
READ LAT:   2415us
UPDATE LAT: 2807us

With FDP (with some minor rocksdb changes):

WAF:        1.67
IOPS:       1547
READ LAT:   1978us
UPDATE LAT: 2267us

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 16:42                         ` Keith Busch
@ 2024-10-30 16:57                           ` Christoph Hellwig
  2024-10-30 17:05                             ` Keith Busch
                                               ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30 16:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > > What??? You said to map the temperature hints to a write stream. The
> > > driver offers that here. But you specifically don't want that? I'm so
> > > confused.
> > 
> > In bdev/fops.c (or file systems if they want to do that) not down in the
> > driver forced down everyones throat.  Which was the whole point of the
> > discussion that we're running in circles here.
> 
> That makes no sense. A change completely isolated to a driver isn't
> forcing anything on anyone. It's the upper layers that's forcing this
> down, whether the driver uses it or not: the hints are already getting
> to the driver, but the driver currently doesn't use it.

And once it uses by default, taking it away will have someone scream
regresion, because we're not taking it away form that super special
use case.

> Here's something recent from rocksdb developers running ycsb worklada
> benchmark. The filesystem used is XFS.

Thanks for finally putting something up.

> It sets temperature hints for different SST levels, which already
> happens today. The last data point made some minor changes with
> level-to-hint mapping.

Do you have a pointer to the changes?

> Without FDP:
> 
> WAF:        2.72
> IOPS:       1465
> READ LAT:   2681us
> UPDATE LAT: 3115us
> 
> With FDP (rocksdb unmodified):
> 
> WAF:        2.26
> IOPS:       1473
> READ LAT:   2415us
> UPDATE LAT: 2807us
> 
> With FDP (with some minor rocksdb changes):
> 
> WAF:        1.67
> IOPS:       1547
> READ LAT:   1978us
> UPDATE LAT: 2267us

Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
which should work just fine with FDP IFF we exposed real write streams,
which roughly double read nad wirte IOPS and reduce the WAF to almost
1 this doesn't look too spectacular to be honest, but it sure it something.

I just wish we could get the real infraѕtructure instead of some band
aid, which makes it really hard to expose the real thing because now
it's been taken up and directly wired to a UAPI.
one

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30  4:55               ` Christoph Hellwig
  2024-10-30 15:41                 ` Keith Busch
@ 2024-10-30 16:59                 ` Bart Van Assche
  2024-10-30 17:14                   ` Christoph Hellwig
  1 sibling, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2024-10-30 16:59 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, Hannes Reinecke,
	Martin K . Petersen


On 10/29/24 9:55 PM, Christoph Hellwig wrote:
> For the temperature hints the only public user I known is rocksdb, and
> that only started working when Hans fixed a brown paperbag bug in the
> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
> something in the Android world does as well, maybe Bart knows more.

UFS devices typically have less internal memory (SRAM) than the size of 
a single zone. Hence, it helps UFS devices if it can be decided at the
time a write command is received where to send the data (SRAM, SLC NAND
or TLC NAND). This is why UFS vendors asked to provide data lifetime
information to zoned logical units. More information about UFS device
internals is available in this paper: Hwang, Joo-Young, Seokhwan Kim,
Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, Sangyeun Cho,
and Youjip Won. "{ZMS}: Zone Abstraction for Mobile Flash Storage." In
2024 USENIX Annual Technical Conference (USENIX ATC 24), pp. 173-189.
2024 (https://www.usenix.org/system/files/atc24-hwang.pdf).

Bart.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 16:57                           ` Christoph Hellwig
@ 2024-10-30 17:05                             ` Keith Busch
  2024-10-30 17:15                               ` Christoph Hellwig
  2024-10-30 17:23                             ` Keith Busch
  2024-10-30 22:32                             ` Keith Busch
  2 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 17:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> And once it uses by default, taking it away will have someone scream
> regresion, because we're not taking it away form that super special
> use case.

Refusing to allow something because someone might find it useful has got
to be the worst reasoning I've heard. :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 16:59                 ` Bart Van Assche
@ 2024-10-30 17:14                   ` Christoph Hellwig
  2024-10-30 17:44                     ` Bart Van Assche
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30 17:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, Keith Busch, linux-block,
	linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
	javier.gonz, Hannes Reinecke, Martin K . Petersen

On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>
> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>> For the temperature hints the only public user I known is rocksdb, and
>> that only started working when Hans fixed a brown paperbag bug in the
>> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
>> something in the Android world does as well, maybe Bart knows more.
>
> UFS devices typically have less internal memory (SRAM) than the size of a 
> single zone.

That wasn't quite the question.  Do you know what application in android
set the fcntl temperature hints?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 17:05                             ` Keith Busch
@ 2024-10-30 17:15                               ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2024-10-30 17:15 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 11:05:03AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > And once it uses by default, taking it away will have someone scream
> > regresion, because we're not taking it away form that super special
> > use case.
> 
> Refusing to allow something because someone might find it useful has got
> to be the worst reasoning I've heard. :)

That's not that point.  The point is by locking us in we can't actually
do the proper thing.  And that's what I'm really worried about.
Especially with the not all that great numbers in the success story.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 16:57                           ` Christoph Hellwig
  2024-10-30 17:05                             ` Keith Busch
@ 2024-10-30 17:23                             ` Keith Busch
  2024-10-30 22:32                             ` Keith Busch
  2 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-30 17:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> 
> > It sets temperature hints for different SST levels, which already
> > happens today. The last data point made some minor changes with
> > level-to-hint mapping.
> 
> Do you have a pointer to the changes?

The change moves levels 2 and 3 to "MEDIUM" (along with 0 and 1 already
there), 4 to "LONG", and >= 5 remain "EXTREME".

WAL continues to be "SHORT", as before.
 
> > Without FDP:
> > 
> > WAF:        2.72
> > IOPS:       1465
> > READ LAT:   2681us
> > UPDATE LAT: 3115us
> > 
> > With FDP (rocksdb unmodified):
> > 
> > WAF:        2.26
> > IOPS:       1473
> > READ LAT:   2415us
> > UPDATE LAT: 2807us
> > 
> > With FDP (with some minor rocksdb changes):
> > 
> > WAF:        1.67
> > IOPS:       1547
> > READ LAT:   1978us
> > UPDATE LAT: 2267us
> 
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.
> 
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one

This doesn't have to be the end placement streams development. I
fundamentally disagree that this locks anyone in to anything.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 17:14                   ` Christoph Hellwig
@ 2024-10-30 17:44                     ` Bart Van Assche
  0 siblings, 0 replies; 42+ messages in thread
From: Bart Van Assche @ 2024-10-30 17:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Keith Busch, linux-block, linux-nvme, linux-scsi,
	io-uring, linux-fsdevel, joshi.k, javier.gonz, Hannes Reinecke,
	Martin K . Petersen, Jaegeuk Kim

On 10/30/24 10:14 AM, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>>
>> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>>> For the temperature hints the only public user I known is rocksdb, and
>>> that only started working when Hans fixed a brown paperbag bug in the
>>> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
>>> something in the Android world does as well, maybe Bart knows more.
>>
>> UFS devices typically have less internal memory (SRAM) than the size of a
>> single zone.
> 
> That wasn't quite the question.  Do you know what application in android
> set the fcntl temperature hints?

I do not know whether there are any Android apps that use the
F_SET_(FILE_|)RW_HINT fcntls.

The only use case in Android platform code I know of is this one: Daejun
Park, "f2fs-tools: add write hint support", f2fs-dev mailing list,
September 2024 
(https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/).
As you probably know f2fs-tools is a software package that includes
fsck.f2fs.

Jaegeuk, please correct me if necessary.

Bart.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-30  4:46     ` Christoph Hellwig
@ 2024-10-30 20:11       ` Keith Busch
  2024-10-30 20:26         ` Bart Van Assche
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 20:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Keith Busch, linux-block, linux-nvme, linux-scsi,
	io-uring, linux-fsdevel, joshi.k, javier.gonz

On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
> >> +}
> >
> > bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> > serialized against the code that tests bits in bdev->write_hint_mask?
> 
> It needs something.  I actually pointed that out last round, but forgot
> about it again this time :)

I disagree. Whether we serialize it or not, writes in flight will either
think it can write or it won't. There's no point adding any overhead to
the IO path for this as you can't stop ending up with inflight writes
using the tag you're trying to turn off.

This is the same as the "ro" attribute. If you're changing it with
BLKROSET ioctl with writes in flight, some of those writes may get past
blkdev_write_iter's read-only check, others may not. No serialization
done there, and this pretty much the same thing.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-30 20:11       ` Keith Busch
@ 2024-10-30 20:26         ` Bart Van Assche
  2024-10-30 20:37           ` Keith Busch
  0 siblings, 1 reply; 42+ messages in thread
From: Bart Van Assche @ 2024-10-30 20:26 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz

On 10/30/24 1:11 PM, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
>> On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
>>>> +}
>>>
>>> bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
>>> serialized against the code that tests bits in bdev->write_hint_mask?
>>
>> It needs something.  I actually pointed that out last round, but forgot
>> about it again this time :)
> 
> I disagree. Whether we serialize it or not, writes in flight will either
> think it can write or it won't. There's no point adding any overhead to
> the IO path for this as you can't stop ending up with inflight writes
> using the tag you're trying to turn off.

Shouldn't the request queue be frozen while this write_hint_mask bitmap
is modified, just like the request queue is frozen while queue limits
are updated? This change wouldn't add any additional overhead to the I/O
path.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-30 20:26         ` Bart Van Assche
@ 2024-10-30 20:37           ` Keith Busch
  2024-10-30 21:15             ` Bart Van Assche
  0 siblings, 1 reply; 42+ messages in thread
From: Keith Busch @ 2024-10-30 20:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz

On Wed, Oct 30, 2024 at 01:26:38PM -0700, Bart Van Assche wrote:
> On 10/30/24 1:11 PM, Keith Busch wrote:
> > On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
> > > On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
> > > > > +}
> > > > 
> > > > bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> > > > serialized against the code that tests bits in bdev->write_hint_mask?
> > > 
> > > It needs something.  I actually pointed that out last round, but forgot
> > > about it again this time :)
> > 
> > I disagree. Whether we serialize it or not, writes in flight will either
> > think it can write or it won't. There's no point adding any overhead to
> > the IO path for this as you can't stop ending up with inflight writes
> > using the tag you're trying to turn off.
> 
> Shouldn't the request queue be frozen while this write_hint_mask bitmap
> is modified, just like the request queue is frozen while queue limits
> are updated? This change wouldn't add any additional overhead to the I/O
> path.

The partitions don't have a queue. If we need to freeze, then changing
one partition's available hints harms IO to other partitions.

Also, block direct IO creates the bio before it freezes. Freezing would
only get writes using the hint you're trying to disable queue up after
all the checks have been done, so you still can't stop making inflight
writes with freeze.

But if by "not atomic", if you're just saying we need a barrier on the
bitmap_copy, like smp_mb__after_atomic(), then yeah, I see that's
probably appropriate here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
  2024-10-30 20:37           ` Keith Busch
@ 2024-10-30 21:15             ` Bart Van Assche
  0 siblings, 0 replies; 42+ messages in thread
From: Bart Van Assche @ 2024-10-30 21:15 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz

On 10/30/24 1:37 PM, Keith Busch wrote:
> But if by "not atomic", if you're just saying we need a barrier on the
> bitmap_copy, like smp_mb__after_atomic(), then yeah, I see that's
> probably appropriate here.

smp_mb__after_atomic() follows atomic operations. bitmap_copy() does not
use any kind of atomic operation.

I'm wondering whether introducing a variant of bitmap_copy() that uses
WRITE_ONCE() or smp_store_release() would be appropriate.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
  2024-10-30 16:57                           ` Christoph Hellwig
  2024-10-30 17:05                             ` Keith Busch
  2024-10-30 17:23                             ` Keith Busch
@ 2024-10-30 22:32                             ` Keith Busch
  2 siblings, 0 replies; 42+ messages in thread
From: Keith Busch @ 2024-10-30 22:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke

On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> > With FDP (with some minor rocksdb changes):
> > 
> > WAF:        1.67
> > IOPS:       1547
> > READ LAT:   1978us
> > UPDATE LAT: 2267us
> 
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.

Hold up... I absolutely appreciate the work Hans is and has done. But
are you talking about this talk?

https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf

That is very much apples-to-oranges. The B+ isn't on the same device
being evaluated for WAF, where this has all that mixed in. I think the
results are pretty good, all things considered.
 
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one

I don't know what make of this. I think we're talking past each other.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-10-30 22:33 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
2024-10-29 17:21   ` Bart Van Assche
2024-10-29 15:19 ` [PATCHv10 2/9] block: introduce max_write_hints queue limit Keith Busch
2024-10-29 15:19 ` [PATCHv10 3/9] statx: add write hint information Keith Busch
2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
2024-10-29 15:23   ` Christoph Hellwig
2024-10-29 17:25   ` Bart Van Assche
2024-10-30  4:46     ` Christoph Hellwig
2024-10-30 20:11       ` Keith Busch
2024-10-30 20:26         ` Bart Van Assche
2024-10-30 20:37           ` Keith Busch
2024-10-30 21:15             ` Bart Van Assche
2024-10-29 15:19 ` [PATCHv10 5/9] block, fs: add write hint to kiocb Keith Busch
2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
2024-10-29 15:19 ` [PATCHv10 7/9] block: export placement hint feature Keith Busch
2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
2024-10-30  0:24   ` Chaitanya Kulkarni
2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
2024-10-29 15:26   ` Christoph Hellwig
2024-10-29 15:34     ` Keith Busch
2024-10-29 15:37       ` Christoph Hellwig
2024-10-29 15:38         ` Keith Busch
2024-10-29 15:53           ` Christoph Hellwig
2024-10-29 16:22             ` Keith Busch
2024-10-30  4:55               ` Christoph Hellwig
2024-10-30 15:41                 ` Keith Busch
2024-10-30 15:45                   ` Christoph Hellwig
2024-10-30 15:48                     ` Keith Busch
2024-10-30 15:50                       ` Christoph Hellwig
2024-10-30 16:42                         ` Keith Busch
2024-10-30 16:57                           ` Christoph Hellwig
2024-10-30 17:05                             ` Keith Busch
2024-10-30 17:15                               ` Christoph Hellwig
2024-10-30 17:23                             ` Keith Busch
2024-10-30 22:32                             ` Keith Busch
2024-10-30 16:59                 ` Bart Van Assche
2024-10-30 17:14                   ` Christoph Hellwig
2024-10-30 17:44                     ` Bart Van Assche
2024-10-29 17:18     ` Bart Van Assche
2024-10-30  5:42       ` Christoph Hellwig
2024-10-29 15:24 ` [PATCHv10 0/9] write hints with nvme fdp, scsi streams Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox