public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHv9 0/7] write hints with nvme fdp, scsi streams
@ 2024-10-25 21:36 Keith Busch
  2024-10-25 21:36 ` [PATCHv9 1/7] block: use generic u16 for write hints Keith Busch
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

A little something for everyone here.

Upfront, I really didn't get the feedback about setting different flags
for stream vs. temperature support. Who wants to use it, and where and
how is that information used?

Changes from v8:

  Added reviews.

  Removed an unused header.

  Changed "hint" to "streams" in the commit logs.

  Ability to split available hints that a partition can use.

  Dropped all the generic filesystem changes that were defaulting to the
  kiocb write_hint. They are unchanged, which having no functional
  change was really the intention anyway, so let's just not change them.

  The above means we don't need a special fop flag to indicate support
  for the kiocb write_hint. Those filesystems that don't support it
  simply don't read it.

  Added the SCSI support since I had to read the spec anyway, and it is
  just a one-line change.

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (5):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block |  7 +++
 block/bdev.c                         | 15 +++++
 block/blk-settings.c                 |  3 +
 block/blk-sysfs.c                    |  3 +
 block/fops.c                         | 26 ++++++++-
 block/partitions/core.c              | 46 +++++++++++++++-
 drivers/nvme/host/core.c             | 82 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 12 ++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/uapi/linux/io_uring.h        |  4 ++
 io_uring/rw.c                        |  3 +-
 16 files changed, 225 insertions(+), 10 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCHv9 1/7] block: use generic u16 for write hints
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 18:19   ` Bart Van Assche
  2024-10-25 21:36 ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Keith Busch
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch,
	Hannes Reinecke

From: Keith Busch <[email protected]>

This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition.

Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/blk-mq.h    | 3 +--
 include/linux/blk_types.h | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 59e9adf815a49..bf007a4081d9b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,7 +8,6 @@
 #include <linux/scatterlist.h>
 #include <linux/prefetch.h>
 #include <linux/srcu.h>
-#include <linux/rw_hint.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -156,7 +155,7 @@ struct request {
 	struct blk_crypto_keyslot *crypt_keyslot;
 #endif
 
-	enum rw_hint write_hint;
+	unsigned short write_hint;
 	unsigned short ioprio;
 
 	enum mq_rq_state state;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..6737795220e18 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -10,7 +10,6 @@
 #include <linux/bvec.h>
 #include <linux/device.h>
 #include <linux/ktime.h>
-#include <linux/rw_hint.h>
 
 struct bio_set;
 struct bio;
@@ -219,7 +218,7 @@ struct bio {
 						 */
 	unsigned short		bi_flags;	/* BIO_* below */
 	unsigned short		bi_ioprio;
-	enum rw_hint		bi_write_hint;
+	unsigned short		bi_write_hint;
 	blk_status_t		bi_status;
 	atomic_t		__bi_remaining;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 2/7] block: introduce max_write_hints queue limit
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-25 21:36 ` [PATCHv9 1/7] block: use generic u16 for write hints Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 11:51   ` Christoph Hellwig
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch,
	Hannes Reinecke

From: Keith Busch <[email protected]>

Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.

Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 Documentation/ABI/stable/sysfs-block |  7 +++++++
 block/blk-settings.c                 |  3 +++
 block/blk-sysfs.c                    |  3 +++
 include/linux/blkdev.h               | 12 ++++++++++++
 4 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8353611107154..f2db2cabb8e75 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
 		[RO] Maximum size in bytes of a single element in a DMA
 		scatter/gather list.
 
+What:		/sys/block/<disk>/queue/max_write_hints
+Date:		October 2024
+Contact:	[email protected]
+Description:
+		[RO] Maximum number of write hints supported, 0 if not
+		supported. If supported, valid values are 1 through
+		max_write_hints, inclusive.
 
 What:		/sys/block/<disk>/queue/max_segments
 Date:		March 2010
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a446654ddee5e..921fb4d334fa4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
 
 	/* Inherit limits from component devices */
+	lim->max_write_hints = USHRT_MAX;
 	lim->max_segments = USHRT_MAX;
 	lim->max_discard_segments = USHRT_MAX;
 	lim->max_hw_sectors = UINT_MAX;
@@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	t->max_segment_size = min_not_zero(t->max_segment_size,
 					   b->max_segment_size);
 
+	t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
+
 	alignment = queue_limit_alignment_offset(b, start);
 
 	/* Bottom device has different alignment.  Check that it is
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741b95dfdbf6f..85f48ca461049 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
 QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_hints)
 QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
 QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
 QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -457,6 +458,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
 QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
 QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
 QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_hints, "max_write_hints");
 QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
 
 QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -591,6 +593,7 @@ static struct attribute *queue_attrs[] = {
 	&queue_max_discard_segments_entry.attr,
 	&queue_max_integrity_segments_entry.attr,
 	&queue_max_segment_size_entry.attr,
+	&queue_max_write_hints_entry.attr,
 	&queue_hw_sector_size_entry.attr,
 	&queue_logical_block_size_entry.attr,
 	&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 55bec14fe55f9..a8ad41ee07234 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,6 +393,8 @@ struct queue_limits {
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
 
+	unsigned short		max_write_hints;
+
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
 
@@ -1183,6 +1185,11 @@ static inline unsigned short queue_max_segments(const struct request_queue *q)
 	return q->limits.max_segments;
 }
 
+static inline unsigned short queue_max_write_hints(struct request_queue *q)
+{
+	return q->limits.max_write_hints;
+}
+
 static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
 {
 	return q->limits.max_discard_segments;
@@ -1230,6 +1237,11 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
 	return queue_max_segments(bdev_get_queue(bdev));
 }
 
+static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
+{
+	return queue_max_write_hints(bdev_get_queue(bdev));
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	return q->limits.logical_block_size;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
  2024-10-25 21:36 ` [PATCHv9 1/7] block: use generic u16 for write hints Keith Busch
  2024-10-25 21:36 ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
                     ` (2 more replies)
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.

Signed-off-by: Keith Busch <[email protected]>
---
 block/bdev.c              | 15 +++++++++++++
 block/partitions/core.c   | 46 +++++++++++++++++++++++++++++++++++++--
 include/linux/blk_types.h |  1 +
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..5d23648db457b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -414,6 +414,7 @@ void __init bdev_cache_init(void)
 
 struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 {
+	unsigned short max_write_hints;
 	struct block_device *bdev;
 	struct inode *inode;
 
@@ -440,6 +441,20 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 		return NULL;
 	}
 	bdev->bd_disk = disk;
+
+	max_write_hints = bdev_max_write_hints(bdev);
+	if (max_write_hints) {
+		int size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
+
+		bdev->write_hint_mask = kmalloc(size, GFP_KERNEL);
+		if (!bdev->write_hint_mask) {
+			free_percpu(bdev->bd_stats);
+			iput(inode);
+			return NULL;
+		}
+		memset(bdev->write_hint_mask, 0xff, size);
+	}
+
 	return bdev;
 }
 
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 815ed33caa1b8..c0ea0a7b6fa87 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -203,6 +203,42 @@ static ssize_t part_discard_alignment_show(struct device *dev,
 	return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
 }
 
+static ssize_t part_write_hint_mask_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+
+	if (max_write_hints)
+		return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
+	else
+		return sprintf(buf, "0");
+}
+
+static ssize_t part_write_hint_mask_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct block_device *bdev = dev_to_bdev(dev);
+	unsigned short max_write_hints = bdev_max_write_hints(bdev);
+	unsigned long *new_mask;
+	int size;
+
+	if (!max_write_hints)
+		return count;
+
+	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
+	new_mask = kzalloc(size, GFP_KERNEL);
+	if (!new_mask)
+		return -ENOMEM;
+
+	bitmap_parse(buf, count, new_mask, max_write_hints);
+	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
+
+	return count;
+}
+
 static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
 static DEVICE_ATTR(start, 0444, part_start_show, NULL);
 static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -211,6 +247,8 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
 static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
 static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
 static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(write_hint_mask, 0644, part_write_hint_mask_show,
+		   part_write_hint_mask_store);
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 static struct device_attribute dev_attr_fail =
 	__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -225,6 +263,7 @@ static struct attribute *part_attrs[] = {
 	&dev_attr_discard_alignment.attr,
 	&dev_attr_stat.attr,
 	&dev_attr_inflight.attr,
+	&dev_attr_write_hint_mask.attr,
 #ifdef CONFIG_FAIL_MAKE_REQUEST
 	&dev_attr_fail.attr,
 #endif
@@ -245,8 +284,11 @@ static const struct attribute_group *part_attr_groups[] = {
 
 static void part_release(struct device *dev)
 {
-	put_disk(dev_to_bdev(dev)->bd_disk);
-	bdev_drop(dev_to_bdev(dev));
+	struct block_device *part = dev_to_bdev(dev);
+
+	kfree(part->write_hint_mask);
+	put_disk(part->bd_disk);
+	bdev_drop(part);
 }
 
 static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6737795220e18..af430e543f7f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -73,6 +73,7 @@ struct block_device {
 #ifdef CONFIG_SECURITY
 	void			*bd_security;
 #endif
+	unsigned long		*write_hint_mask;
 	/*
 	 * keep this out-of-line as it's both big and not needed in the fast
 	 * path
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (2 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 11:59   ` Christoph Hellwig
  2024-10-25 21:36 ` [PATCHv9 5/7] io_uring: enable per-io hinting capability Keith Busch
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's capabilities.

Signed-off-by: Keith Busch <[email protected]>
---
 block/fops.c       | 26 +++++++++++++++++++++++---
 include/linux/fs.h |  1 +
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 2d01c90076813..e3f3f1957d86d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -71,7 +71,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -200,7 +200,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -316,7 +316,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
@@ -362,6 +362,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	return -EIOCBQUEUED;
 }
 
+static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
+{
+	u16 hint = iocb->ki_write_hint;
+
+	if (!hint)
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	if (hint > bdev_max_write_hints(bdev))
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	if (bdev_is_partition(bdev) &&
+	    !test_bit(hint - 1, bdev->write_hint_mask))
+		return file_inode(iocb->ki_filp)->i_write_hint;
+
+	return hint;
+}
+
 static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
@@ -373,6 +390,9 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	if (blkdev_dio_invalid(bdev, iocb, iter))
 		return -EINVAL;
 
+	if (iov_iter_rw(iter) == WRITE)
+		iocb->ki_write_hint = blkdev_write_hint(iocb, bdev);
+
 	nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
 	if (likely(nr_pages <= BIO_MAX_VECS)) {
 		if (is_sync_kiocb(iocb))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a1268..1a00accf412e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	u16			ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 5/7] io_uring: enable per-io hinting capability
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (3 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-29 12:46   ` Anuj gupta
  2024-10-25 21:36 ` [PATCHv9 6/7] nvme: enable FDP support Keith Busch
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche,
	Hannes Reinecke, Nitesh Shetty, Keith Busch

From: Kanchan Joshi <[email protected]>

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes will be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.

Allow userspace to pass additional metadata in the SQE.

	__u16 write_hint;

If the hint is provided, filesystems may optionally use it. A filesytem
may ignore this field if it does not support per-io hints, or if the
value is invalid for its backing storage. Just like the inode hints,
requesting values that are not supported by the hardware are not an
error.

Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 include/uapi/linux/io_uring.h | 4 ++++
 io_uring/rw.c                 | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 60b9c98595faf..8cdcc461d464c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@ struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			__u16	write_hint;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8080ffd6d5712..5a1231bfecc3a 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
-
+	if (ddir == ITER_SOURCE)
+		rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 6/7] nvme: enable FDP support
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (4 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 5/7] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
  2024-10-28 11:49 ` [PATCHv9 0/7] write hints with nvme fdp, scsi streams Christoph Hellwig
  7 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Hui Qi,
	Nitesh Shetty, Hannes Reinecke, Keith Busch

From: Kanchan Joshi <[email protected]>

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.
Userspace can send the write hint information using io_uring or fcntl.

Fetch the placement-identifiers if the device supports FDP. The incoming
write-hint is mapped to a placement-identifier, which in turn is set in
the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Nacked-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
 drivers/nvme/host/core.c | 82 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  5 +++
 include/linux/nvme.h     | 19 ++++++++++
 3 files changed, 106 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d0..36c2b9be8eee7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
 	bool is_removed;
 };
 
+struct nvme_fdp_ruh_status_desc {
+	u16 pid;
+	u16 ruhid;
+	u32 earutr;
+	u64 ruamw;
+	u8  rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+	u8  rsvd0[14];
+	__le16 nruhsd;
+	struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
 	ida_free(&head->subsys->ns_ida, head->instance);
 	cleanup_srcu_struct(&head->srcu);
 	nvme_put_subsystem(head->subsys);
+	kfree(head->plids);
 	kfree(head);
 }
 
@@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
+	if (req->write_hint && ns->head->nr_plids) {
+		u16 hint = max(req->write_hint, ns->head->nr_plids);
+
+		dsmgmt |= ns->head->plids[hint - 1] << 16;
+		control |= NVME_RW_DTYPE_DPLCMT;
+	}
+
 	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
 		return BLK_STS_INVAL;
 
@@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	return ret;
 }
 
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+	struct nvme_fdp_ruh_status_desc *ruhsd;
+	struct nvme_ns_head *head = ns->head;
+	struct nvme_fdp_ruh_status *ruhs;
+	struct nvme_command c = {};
+	int size, ret, i;
+
+	if (head->plids)
+		return 0;
+
+	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+	ruhs = kzalloc(size, GFP_KERNEL);
+	if (!ruhs)
+		return -ENOMEM;
+
+	c.imr.opcode = nvme_cmd_io_mgmt_recv;
+	c.imr.nsid = cpu_to_le32(nsid);
+	c.imr.mo = 0x1;
+	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+	if (ret)
+		goto out;
+
+	i = le16_to_cpu(ruhs->nruhsd);
+	if (!i)
+		goto out;
+
+	ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
+	head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
+			      GFP_KERNEL);
+	if (!head->plids) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < ns->head->nr_plids; i++) {
+		ruhsd = &ruhs->ruhsd[i];
+		head->plids[i] = le16_to_cpu(ruhsd->pid);
+	}
+out:
+	kfree(ruhs);
+	return ret;
+}
+
 static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		struct nvme_ns_info *info)
 {
@@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 			goto out;
 	}
 
+	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+		ret = nvme_fetch_fdp_plids(ns, info->nsid);
+		if (ret)
+			dev_warn(ns->ctrl->device,
+				"FDP failure status:0x%x\n", ret);
+		if (ret < 0)
+			goto out;
+	} else {
+		ns->head->nr_plids = 0;
+		kfree(ns->head->plids);
+		ns->head->plids = NULL;
+	}
+
 	blk_mq_freeze_queue(ns->disk->queue);
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
@@ -2171,6 +2252,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	if (!nvme_init_integrity(ns->head, &lim, info))
 		capacity = 0;
 
+	lim.max_write_hints = ns->head->nr_plids;
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
 	if (ret) {
 		blk_mq_unfreeze_queue(ns->disk->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536b..cec8e5d96377b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,9 @@ struct nvme_ns_head {
 	struct device		cdev_device;
 
 	struct gendisk		*disk;
+
+	u16                     nr_plids;
+	u16			*plids;
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e0..a954eaee5b0f3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
 	NVME_CTRL_ATTR_HID_128_BIT	= (1 << 0),
 	NVME_CTRL_ATTR_TBKAS		= (1 << 6),
 	NVME_CTRL_ATTR_ELBAS		= (1 << 15),
+	NVME_CTRL_ATTR_FDPS		= (1 << 19),
 };
 
 struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_register	= 0x0d,
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_io_mgmt_recv	= 0x12,
 	nvme_cmd_resv_release	= 0x15,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
+		nvme_opcode_name(nvme_cmd_io_mgmt_recv),	\
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
@@ -1015,6 +1018,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
 	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+	NVME_RW_DTYPE_DPLCMT		= 2 << 4,
 	NVME_WZ_DEAC			= 1 << 9,
 };
 
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_io_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			mo;
+	__u8			rsvd11;
+	__u16			mos;
+	__le32			numd;
+	__le32			cdw12[4];
+};
+
 enum {
 	NVME_ZRA_ZONE_REPORT		= 0,
 	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (5 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 6/7] nvme: enable FDP support Keith Busch
@ 2024-10-25 21:36 ` Keith Busch
  2024-10-28 16:13   ` Bart Van Assche
  2024-10-29  7:10   ` Hannes Reinecke
  2024-10-28 11:49 ` [PATCHv9 0/7] write hints with nvme fdp, scsi streams Christoph Hellwig
  7 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

From: Keith Busch <[email protected]>

The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.

Signed-off-by: Keith Busch <[email protected]>
---
 drivers/scsi/sd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76adc..235dd6e5b6688 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_config_protection(sdkp, &lim);
 	}
 
+	lim.max_write_hints = sdkp->permanent_stream_count;
+
 	/*
 	 * We now have all cache related info, determine how we deal
 	 * with flush requests.
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 0/7] write hints with nvme fdp, scsi streams
  2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
                   ` (6 preceding siblings ...)
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-28 11:49 ` Christoph Hellwig
  7 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:38PM -0700, Keith Busch wrote:
> Upfront, I really didn't get the feedback about setting different flags
> for stream vs. temperature support. Who wants to use it, and where and
> how is that information used?

The SBC permanent streams have a 'relative lifetime' associated with
them.  So if you expose them as plain write streams you violate the
contract with the device.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 2/7] block: introduce max_write_hints queue limit
  2024-10-25 21:36 ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-28 11:51   ` Christoph Hellwig
  2024-10-28 11:52     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke

On Fri, Oct 25, 2024 at 02:36:40PM -0700, Keith Busch wrote:
> +static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
> +{
> +	return queue_max_write_hints(bdev_get_queue(bdev));
> +}

As pointed out by Bart last time, you can't simply give the write hints
to all block device.  Assume we'd want to wire up the write stream based
separate to f2fs (which btw would be a good demonstration), and you'd
have two different f2fs file systems on separate partitions that'd
now start sharing the write streams if they simply started from stream
1.  Same for our pending XFS data placement work.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 2/7] block: introduce max_write_hints queue limit
  2024-10-28 11:51   ` Christoph Hellwig
@ 2024-10-28 11:52     ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke

On Mon, Oct 28, 2024 at 12:51:32PM +0100, Christoph Hellwig wrote:
> As pointed out by Bart last time, you can't simply give the write hints
> to all block device.  Assume we'd want to wire up the write stream based
> separate to f2fs (which btw would be a good demonstration), and you'd
> have two different f2fs file systems on separate partitions that'd
> now start sharing the write streams if they simply started from stream
> 1.  Same for our pending XFS data placement work.

And I'm an idiot and should have looked at the next patch patch first.
Sorry for that.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-28 11:58   ` Christoph Hellwig
  2024-10-28 14:49     ` Keith Busch
  2024-10-28 14:40   ` Kanchan Joshi
  2024-10-28 18:27   ` Bart Van Assche
  2 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:41PM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.

Trying my best Greg impersonator voice: This needs to be documented
in Documentation/ABI/stable/sysfs-block.

That would have also helped me understanding it.  AFAIK the split here
is an opt-in, which means the use case I explained in the previous
case would still not work out of the box, right?

> +	max_write_hints = bdev_max_write_hints(bdev);
> +	if (max_write_hints) {
> +		int size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +
> +		bdev->write_hint_mask = kmalloc(size, GFP_KERNEL);
> +		if (!bdev->write_hint_mask) {
> +			free_percpu(bdev->bd_stats);
> +			iput(inode);
> +			return NULL;
> +		}
> +		memset(bdev->write_hint_mask, 0xff, size);
> +	}

This could simply use bitmap_alloc().  Similarly the other uses
would probably benefit from using the bitmap API.

> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +
> +	if (max_write_hints)
> +		return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
> +	else
> +		return sprintf(buf, "0");

No need for the else.  And if you write this as:

	if (!max_write_hints)
		return sprintf(buf, "0");
	return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);

you'd also avoid the overly long line.

> +
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +	unsigned long *new_mask;
> +	int size;
> +
> +	if (!max_write_hints)
> +		return count;
> +
> +	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +	new_mask = kzalloc(size, GFP_KERNEL);
> +	if (!new_mask)
> +		return -ENOMEM;
> +
> +	bitmap_parse(buf, count, new_mask, max_write_hints);
> +	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);

What protects access to bdev->write_hint_mask?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-28 11:59   ` Christoph Hellwig
  2024-10-28 14:38     ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 11:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Keith Busch

On Fri, Oct 25, 2024 at 02:36:42PM -0700, Keith Busch wrote:
> +static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
> +{
> +	u16 hint = iocb->ki_write_hint;
> +
> +	if (!hint)
> +		return file_inode(iocb->ki_filp)->i_write_hint;
> +
> +	if (hint > bdev_max_write_hints(bdev))
> +		return file_inode(iocb->ki_filp)->i_write_hint;
> +
> +	if (bdev_is_partition(bdev) &&
> +	    !test_bit(hint - 1, bdev->write_hint_mask))
> +		return file_inode(iocb->ki_filp)->i_write_hint;

I would have expected an error when using an invalid stream identifier.

That of course requires telling the application how many are available
through e.g. statx as requested last time.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-28 11:59   ` Christoph Hellwig
@ 2024-10-28 14:38     ` Keith Busch
  2024-10-28 16:08       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-10-28 14:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche

On Mon, Oct 28, 2024 at 12:59:32PM +0100, Christoph Hellwig wrote:
> On Fri, Oct 25, 2024 at 02:36:42PM -0700, Keith Busch wrote:
> > +static u16 blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
> > +{
> > +	u16 hint = iocb->ki_write_hint;
> > +
> > +	if (!hint)
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > +
> > +	if (hint > bdev_max_write_hints(bdev))
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > +
> > +	if (bdev_is_partition(bdev) &&
> > +	    !test_bit(hint - 1, bdev->write_hint_mask))
> > +		return file_inode(iocb->ki_filp)->i_write_hint;
> 
> I would have expected an error when using an invalid stream identifier.

It's a hint. fcntl doesn't error if you give an unusable hint, so
neither should this. You get sane default behavior.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
@ 2024-10-28 14:40   ` Kanchan Joshi
  2024-10-28 18:27   ` Bart Van Assche
  2 siblings, 0 replies; 24+ messages in thread
From: Kanchan Joshi @ 2024-10-28 14:40 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, javier.gonz, bvanassche, Keith Busch

On 10/26/2024 3:06 AM, Keith Busch wrote:
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct block_device *bdev = dev_to_bdev(dev);
> +	unsigned short max_write_hints = bdev_max_write_hints(bdev);
> +	unsigned long *new_mask;
> +	int size;
> +
> +	if (!max_write_hints)
> +		return count;
> +
> +	size = BITS_TO_LONGS(max_write_hints) * sizeof(long);
> +	new_mask = kzalloc(size, GFP_KERNEL);
> +	if (!new_mask)
> +		return -ENOMEM;
> +
> +	bitmap_parse(buf, count, new_mask, max_write_hints);
> +	bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);

kfree(new_mask) here.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-28 11:58   ` Christoph Hellwig
@ 2024-10-28 14:49     ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-28 14:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, joshi.k, javier.gonz, bvanassche

On Mon, Oct 28, 2024 at 12:58:05PM +0100, Christoph Hellwig wrote:
> On Fri, Oct 25, 2024 at 02:36:41PM -0700, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> > 
> > When multiple partitions are used, you may want to enforce different
> > subsets of the available write hints for each partition. Provide a
> > bitmap attribute of the available write hints, and allow an admin to
> > write a different mask to set the partition's allowed write hints.
> 
> Trying my best Greg impersonator voice: This needs to be documented
> in Documentation/ABI/stable/sysfs-block.
> 
> That would have also helped me understanding it.  AFAIK the split here
> is an opt-in, which means the use case I explained in the previous
> case would still not work out of the box, right?

Right.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 4/7] block, fs: add write hint to kiocb
  2024-10-28 14:38     ` Keith Busch
@ 2024-10-28 16:08       ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-28 16:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
	linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
	bvanassche

On Mon, Oct 28, 2024 at 08:38:05AM -0600, Keith Busch wrote:
> > > +	if (bdev_is_partition(bdev) &&
> > > +	    !test_bit(hint - 1, bdev->write_hint_mask))
> > > +		return file_inode(iocb->ki_filp)->i_write_hint;
> > 
> > I would have expected an error when using an invalid stream identifier.
> 
> It's a hint. fcntl doesn't error if you give an unusable hint, so
> neither should this. You get sane default behavior.

Well, why does it have to be a hint?

If I have a data placement aware application I really want to know
into how many buckets I can sort and adjust my algorithm for it.
And I'd rather have an error checked interface to pass that down
as far as I can.

Same for my file system use case.  I guess you have a use case where
a hint would be enough, at least with good enough knowledge of the
underlying implementation.  But would there be an actual downside
in not having a hint?  Because historically speaking everything we've
done as a not error checked vaguely defined hint has not been all
the useful, but it in hardware or software interfaces.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-28 16:13   ` Bart Van Assche
  2024-10-29  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2024-10-28 16:13 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/25/24 2:36 PM, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   drivers/scsi/sd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index ca4bc0ac76adc..235dd6e5b6688 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
>   		sd_config_protection(sdkp, &lim);
>   	}
>   
> +	lim.max_write_hints = sdkp->permanent_stream_count;
> +
>   	/*
>   	 * We now have all cache related info, determine how we deal
>   	 * with flush requests.

Reviewed-by: Bart Van Assche <[email protected]>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 1/7] block: use generic u16 for write hints
  2024-10-25 21:36 ` [PATCHv9 1/7] block: use generic u16 for write hints Keith Busch
@ 2024-10-28 18:19   ` Bart Van Assche
  2024-10-28 18:38     ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2024-10-28 18:19 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch,
	Hannes Reinecke

On 10/25/24 2:36 PM, Keith Busch wrote:
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition.

Since struct bio is modified, and since it is important not to increase
the size of struct bio, some comments about whether or not the size of
struct bio is affected would be welcome.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
  2024-10-28 11:58   ` Christoph Hellwig
  2024-10-28 14:40   ` Kanchan Joshi
@ 2024-10-28 18:27   ` Bart Van Assche
  2024-10-28 19:46     ` Keith Busch
  2 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2024-10-28 18:27 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch

On 10/25/24 2:36 PM, Keith Busch wrote:
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.

After /proc/irq/*/smp_affinity was introduced (a bitmask),
/proc/irq/*/smp_affinity_list (set of ranges) was introduced as a more
user-friendly alternative. Is the same expected to happen with the
write_hint_mask? If so, shouldn't we skip the bitmask user space
interface and directly introduce the more user friendly interface (set
of ranges)?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 1/7] block: use generic u16 for write hints
  2024-10-28 18:19   ` Bart Van Assche
@ 2024-10-28 18:38     ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-28 18:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, hch, joshi.k, javier.gonz, Hannes Reinecke

On Mon, Oct 28, 2024 at 11:19:33AM -0700, Bart Van Assche wrote:
> On 10/25/24 2:36 PM, Keith Busch wrote:
> > This is still backwards compatible with lifetime hints. It just doesn't
> > constrain the hints to that definition.
> 
> Since struct bio is modified, and since it is important not to increase
> the size of struct bio, some comments about whether or not the size of
> struct bio is affected would be welcome.

Sure. The type just shrinks a hole from 2 bytes to 1. Total size remains
the same.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 3/7] block: allow ability to limit partition write hints
  2024-10-28 18:27   ` Bart Van Assche
@ 2024-10-28 19:46     ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-10-28 19:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
	linux-fsdevel, hch, joshi.k, javier.gonz

On Mon, Oct 28, 2024 at 11:27:33AM -0700, Bart Van Assche wrote:
> On 10/25/24 2:36 PM, Keith Busch wrote:
> > When multiple partitions are used, you may want to enforce different
> > subsets of the available write hints for each partition. Provide a
> > bitmap attribute of the available write hints, and allow an admin to
> > write a different mask to set the partition's allowed write hints.
> 
> After /proc/irq/*/smp_affinity was introduced (a bitmask),
> /proc/irq/*/smp_affinity_list (set of ranges) was introduced as a more
> user-friendly alternative. Is the same expected to happen with the
> write_hint_mask? If so, shouldn't we skip the bitmask user space
> interface and directly introduce the more user friendly interface (set
> of ranges)?

I don't much of have an opinion either way. One thing I like for the
bitmask representation is you write 0 to turn it off vs. the list type
writes a null string. Writing 0 to disable just feels more natural to
me, but not a big deal.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 7/7] scsi: set permanent stream count in block limits
  2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
  2024-10-28 16:13   ` Bart Van Assche
@ 2024-10-29  7:10   ` Hannes Reinecke
  1 sibling, 0 replies; 24+ messages in thread
From: Hannes Reinecke @ 2024-10-29  7:10 UTC (permalink / raw)
  To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
  Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch

On 10/25/24 23:36, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   drivers/scsi/sd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index ca4bc0ac76adc..235dd6e5b6688 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
>   		sd_config_protection(sdkp, &lim);
>   	}
>   
> +	lim.max_write_hints = sdkp->permanent_stream_count;
> +
>   	/*
>   	 * We now have all cache related info, determine how we deal
>   	 * with flush requests.
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
[email protected]                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHv9 5/7] io_uring: enable per-io hinting capability
  2024-10-25 21:36 ` [PATCHv9 5/7] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-29 12:46   ` Anuj gupta
  0 siblings, 0 replies; 24+ messages in thread
From: Anuj gupta @ 2024-10-29 12:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
	joshi.k, javier.gonz, bvanassche, Hannes Reinecke, Nitesh Shetty,
	Keith Busch

On Sat, Oct 26, 2024 at 3:13 AM Keith Busch <[email protected]> wrote:
>
> From: Kanchan Joshi <[email protected]>
>
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down. This
> can be limiting for block device as all the writes will be tagged with
> only one lifetime hint value. Concurrent writes (with different hint
> values) are hard to manage. Per-IO hinting solves that problem.
>
> Allow userspace to pass additional metadata in the SQE.
>
>         __u16 write_hint;
>
> If the hint is provided, filesystems may optionally use it. A filesytem
> may ignore this field if it does not support per-io hints, or if the
> value is invalid for its backing storage. Just like the inode hints,
> requesting values that are not supported by the hardware are not an
> error.
>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  include/uapi/linux/io_uring.h | 4 ++++
>  io_uring/rw.c                 | 3 ++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 60b9c98595faf..8cdcc461d464c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -92,6 +92,10 @@ struct io_uring_sqe {
>                         __u16   addr_len;
>                         __u16   __pad3[1];
>                 };
> +               struct {
> +                       __u16   write_hint;
> +                       __u16   __pad4[1];
> +               };
>         };
>         union {
>                 struct {
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 8080ffd6d5712..5a1231bfecc3a 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                 rw->kiocb.ki_ioprio = get_current_ioprio();
>         }
>         rw->kiocb.dio_complete = NULL;
> -
> +       if (ddir == ITER_SOURCE)
> +               rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
>         rw->addr = READ_ONCE(sqe->addr);
>         rw->len = READ_ONCE(sqe->len);
>         rw->flags = READ_ONCE(sqe->rw_flags);
> --
> 2.43.5
>

Since this patch adds a couple of new fields, it makes sense to add
BUILD_BUG_ON() checks in io_uring_init for these fields to assert the
layout of struct io_uring_sqe. And probably a zero check for pad4 in
io_prep_rw.
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-10-29 12:47 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 21:36 [PATCHv9 0/7] write hints with nvme fdp, scsi streams Keith Busch
2024-10-25 21:36 ` [PATCHv9 1/7] block: use generic u16 for write hints Keith Busch
2024-10-28 18:19   ` Bart Van Assche
2024-10-28 18:38     ` Keith Busch
2024-10-25 21:36 ` [PATCHv9 2/7] block: introduce max_write_hints queue limit Keith Busch
2024-10-28 11:51   ` Christoph Hellwig
2024-10-28 11:52     ` Christoph Hellwig
2024-10-25 21:36 ` [PATCHv9 3/7] block: allow ability to limit partition write hints Keith Busch
2024-10-28 11:58   ` Christoph Hellwig
2024-10-28 14:49     ` Keith Busch
2024-10-28 14:40   ` Kanchan Joshi
2024-10-28 18:27   ` Bart Van Assche
2024-10-28 19:46     ` Keith Busch
2024-10-25 21:36 ` [PATCHv9 4/7] block, fs: add write hint to kiocb Keith Busch
2024-10-28 11:59   ` Christoph Hellwig
2024-10-28 14:38     ` Keith Busch
2024-10-28 16:08       ` Christoph Hellwig
2024-10-25 21:36 ` [PATCHv9 5/7] io_uring: enable per-io hinting capability Keith Busch
2024-10-29 12:46   ` Anuj gupta
2024-10-25 21:36 ` [PATCHv9 6/7] nvme: enable FDP support Keith Busch
2024-10-25 21:36 ` [PATCHv9 7/7] scsi: set permanent stream count in block limits Keith Busch
2024-10-28 16:13   ` Bart Van Assche
2024-10-29  7:10   ` Hannes Reinecke
2024-10-28 11:49 ` [PATCHv9 0/7] write hints with nvme fdp, scsi streams Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox