* [PATCHv10 1/9] block: use generic u16 for write hints
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 17:21 ` Bart Van Assche
2024-10-29 15:19 ` [PATCHv10 2/9] block: introduce max_write_hints queue limit Keith Busch
` (9 subsequent siblings)
10 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition. Using this type doesn't change
the size of either bio or request.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/blk-mq.h | 3 +--
include/linux/blk_types.h | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2035fad3131fb..08ed7b5c4dbbf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,7 +8,6 @@
#include <linux/scatterlist.h>
#include <linux/prefetch.h>
#include <linux/srcu.h>
-#include <linux/rw_hint.h>
struct blk_mq_tags;
struct blk_flush_queue;
@@ -156,7 +155,7 @@ struct request {
struct blk_crypto_keyslot *crypt_keyslot;
#endif
- enum rw_hint write_hint;
+ unsigned short write_hint;
unsigned short ioprio;
enum mq_rq_state state;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..6737795220e18 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -10,7 +10,6 @@
#include <linux/bvec.h>
#include <linux/device.h>
#include <linux/ktime.h>
-#include <linux/rw_hint.h>
struct bio_set;
struct bio;
@@ -219,7 +218,7 @@ struct bio {
*/
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
- enum rw_hint bi_write_hint;
+ unsigned short bi_write_hint;
blk_status_t bi_status;
atomic_t __bi_remaining;
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 1/9] block: use generic u16 for write hints
2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
@ 2024-10-29 17:21 ` Bart Van Assche
0 siblings, 0 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:21 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch
On 10/29/24 8:19 AM, Keith Busch wrote:
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition. Using this type doesn't change
> the size of either bio or request.
Reviewed-by: Bart Van Assche <[email protected]>
^ permalink raw reply [flat|nested] 106+ messages in thread
* [PATCHv10 2/9] block: introduce max_write_hints queue limit
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:19 ` [PATCHv10 3/9] statx: add write hint information Keith Busch
` (8 subsequent siblings)
10 siblings, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 7 +++++++
block/blk-settings.c | 3 +++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 12 ++++++++++++
4 files changed, 25 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8353611107154..f2db2cabb8e75 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
+What: /sys/block/<disk>/queue/max_write_hints
+Date: October 2024
+Contact: [email protected]
+Description:
+ [RO] Maximum number of write hints supported, 0 if not
+ supported. If supported, valid values are 1 through
+ max_write_hints, inclusive.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a446654ddee5e..921fb4d334fa4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
/* Inherit limits from component devices */
+ lim->max_write_hints = USHRT_MAX;
lim->max_segments = USHRT_MAX;
lim->max_discard_segments = USHRT_MAX;
lim->max_hw_sectors = UINT_MAX;
@@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
+
alignment = queue_limit_alignment_offset(b, start);
/* Bottom device has different alignment. Check that it is
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741b95dfdbf6f..85f48ca461049 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_hints)
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -457,6 +458,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_hints, "max_write_hints");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -591,6 +593,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_discard_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
+ &queue_max_write_hints_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d0a52ed05e60c..90a19ea889276 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -394,6 +394,8 @@ struct queue_limits {
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_write_hints;
+
unsigned int max_open_zones;
unsigned int max_active_zones;
@@ -1189,6 +1191,11 @@ static inline unsigned short queue_max_segments(const struct request_queue *q)
return q->limits.max_segments;
}
+static inline unsigned short queue_max_write_hints(struct request_queue *q)
+{
+ return q->limits.max_write_hints;
+}
+
static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
{
return q->limits.max_discard_segments;
@@ -1236,6 +1243,11 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
return queue_max_segments(bdev_get_queue(bdev));
}
+static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
+{
+ return queue_max_write_hints(bdev_get_queue(bdev));
+}
+
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
return q->limits.logical_block_size;
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* [PATCHv10 3/9] statx: add write hint information
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
2024-10-29 15:19 ` [PATCHv10 1/9] block: use generic u16 for write hints Keith Busch
2024-10-29 15:19 ` [PATCHv10 2/9] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
` (7 subsequent siblings)
10 siblings, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
If requested on a raw block device, report the maximum write hint the
block device supports.
Signed-off-by: Keith Busch <[email protected]>
---
block/bdev.c | 5 +++++
fs/stat.c | 1 +
include/linux/stat.h | 1 +
include/uapi/linux/stat.h | 3 ++-
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..9a59f0c882170 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1296,6 +1296,11 @@ void bdev_statx(struct path *path, struct kstat *stat,
stat->result_mask |= STATX_DIOALIGN;
}
+ if (request_mask & STATX_WRITE_HINT) {
+ stat->write_hint_max = bdev_max_write_hints(bdev);
+ stat->result_mask |= STATX_WRITE_HINT;
+ }
+
if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) {
struct request_queue *bd_queue = bdev->bd_queue;
diff --git a/fs/stat.c b/fs/stat.c
index 41e598376d7e3..60bcd5c2e2a1d 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -704,6 +704,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
+ tmp.stx_write_hint_max = stat->write_hint_max;
return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 3d900c86981c5..48f0f64846a02 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -57,6 +57,7 @@ struct kstat {
u32 atomic_write_unit_min;
u32 atomic_write_unit_max;
u32 atomic_write_segments_max;
+ u32 write_hint_max;
};
/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a252864416..10f5622c21113 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -132,7 +132,7 @@ struct statx {
__u32 stx_atomic_write_unit_max; /* Max atomic write unit in bytes */
/* 0xb0 */
__u32 stx_atomic_write_segments_max; /* Max atomic write segment count */
- __u32 __spare1[1];
+ __u32 stx_write_hint_max;
/* 0xb8 */
__u64 __spare3[9]; /* Spare space for future expansion */
/* 0x100 */
@@ -164,6 +164,7 @@ struct statx {
#define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */
#define STATX_SUBVOL 0x00008000U /* Want/got stx_subvol */
#define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */
+#define STATX_WRITE_HINT 0x00020000U /* Want/got write_hint_max */
#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (2 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 3/9] statx: add write hint information Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:23 ` Christoph Hellwig
2024-10-29 17:25 ` Bart Van Assche
2024-10-29 15:19 ` [PATCHv10 5/9] block, fs: add write hint to kiocb Keith Busch
` (6 subsequent siblings)
10 siblings, 2 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 6 ++++
block/bdev.c | 13 ++++++++
block/partitions/core.c | 44 ++++++++++++++++++++++++++--
include/linux/blk_types.h | 1 +
4 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index f2db2cabb8e75..8ab9f030e61d1 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -187,6 +187,12 @@ Description:
partition is offset from the internal allocation unit's
natural alignment.
+What: /sys/block/<disk>/<partition>/write_hint_mask
+Date: October 2024
+Contact: [email protected]
+Description:
+ The mask of allowed write hints. You can limit which hints the
+ block layer will use by writing a new mask.
What: /sys/block/<disk>/<partition>/stat
Date: February 2008
diff --git a/block/bdev.c b/block/bdev.c
index 9a59f0c882170..0c5e87b111aed 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -415,6 +415,7 @@ void __init bdev_cache_init(void)
struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
{
struct block_device *bdev;
+ unsigned short write_hint;
struct inode *inode;
inode = new_inode(blockdev_superblock);
@@ -440,6 +441,18 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+
+ write_hint = bdev_max_write_hints(bdev);
+ if (write_hint) {
+ bdev->write_hint_mask = bitmap_alloc(write_hint, GFP_KERNEL);
+ if (!bdev->write_hint_mask) {
+ free_percpu(bdev->bd_stats);
+ iput(inode);
+ return NULL;
+ }
+ bitmap_set(bdev->write_hint_mask, 0, write_hint);
+ }
+
return bdev;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 815ed33caa1b8..24e1a79972f75 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -203,6 +203,40 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_write_hint_mask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ unsigned short max_write_hints = bdev_max_write_hints(bdev);
+
+ if (!max_write_hints)
+ return sprintf(buf, "0");
+ return sprintf(buf, "%*pb\n", max_write_hints, bdev->write_hint_mask);
+}
+
+static ssize_t part_write_hint_mask_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct block_device *bdev = dev_to_bdev(dev);
+ unsigned short max_write_hints = bdev_max_write_hints(bdev);
+ unsigned long *new_mask;
+
+ if (!max_write_hints)
+ return count;
+
+ new_mask = bitmap_alloc(max_write_hints, GFP_KERNEL);
+ if (!new_mask)
+ return -ENOMEM;
+
+ bitmap_parse(buf, count, new_mask, max_write_hints);
+ bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
+ bitmap_free(new_mask);
+
+ return count;
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -211,6 +245,8 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(write_hint_mask, 0644, part_write_hint_mask_show,
+ part_write_hint_mask_store);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -225,6 +261,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_write_hint_mask.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
@@ -245,8 +282,11 @@ static const struct attribute_group *part_attr_groups[] = {
static void part_release(struct device *dev)
{
- put_disk(dev_to_bdev(dev)->bd_disk);
- bdev_drop(dev_to_bdev(dev));
+ struct block_device *part = dev_to_bdev(dev);
+
+ bitmap_free(part->write_hint_mask);
+ put_disk(part->bd_disk);
+ bdev_drop(part);
}
static int part_uevent(const struct device *dev, struct kobj_uevent_env *env)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6737795220e18..af430e543f7f7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -73,6 +73,7 @@ struct block_device {
#ifdef CONFIG_SECURITY
void *bd_security;
#endif
+ unsigned long *write_hint_mask;
/*
* keep this out-of-line as it's both big and not needed in the fast
* path
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-29 15:23 ` Christoph Hellwig
2024-10-29 17:25 ` Bart Van Assche
1 sibling, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:23 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
joshi.k, javier.gonz, bvanassche, Keith Busch
On Tue, Oct 29, 2024 at 08:19:17AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> When multiple partitions are used, you may want to enforce different
> subsets of the available write hints for each partition. Provide a
> bitmap attribute of the available write hints, and allow an admin to
> write a different mask to set the partition's allowed write hints.
This still offers all hints to all partitions by default, which still
breaks all use cases where you have actual users of the stream separation
on multiple paritions.
Please assign either all resources to the first partition and none to
the others (probably the easiest and useful for the most common use
case) or split it evenly.
Note that the bdev_max_streams value also needs to be adjusted to only
return the number of streams actually available to a partition.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
2024-10-29 15:23 ` Christoph Hellwig
@ 2024-10-29 17:25 ` Bart Van Assche
2024-10-30 4:46 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:25 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch
On 10/29/24 8:19 AM, Keith Busch wrote:
> +static ssize_t part_write_hint_mask_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct block_device *bdev = dev_to_bdev(dev);
> + unsigned short max_write_hints = bdev_max_write_hints(bdev);
> + unsigned long *new_mask;
> +
> + if (!max_write_hints)
> + return count;
> +
> + new_mask = bitmap_alloc(max_write_hints, GFP_KERNEL);
> + if (!new_mask)
> + return -ENOMEM;
> +
> + bitmap_parse(buf, count, new_mask, max_write_hints);
> + bitmap_copy(bdev->write_hint_mask, new_mask, max_write_hints);
> + bitmap_free(new_mask);
> +
> + return count;
> +}
bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
serialized against the code that tests bits in bdev->write_hint_mask?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-29 17:25 ` Bart Van Assche
@ 2024-10-30 4:46 ` Christoph Hellwig
2024-10-30 20:11 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 4:46 UTC (permalink / raw)
To: Bart Van Assche
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, hch, joshi.k, javier.gonz, Keith Busch
On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
>> +}
>
> bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> serialized against the code that tests bits in bdev->write_hint_mask?
It needs something. I actually pointed that out last round, but forgot
about it again this time :)
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-30 4:46 ` Christoph Hellwig
@ 2024-10-30 20:11 ` Keith Busch
2024-10-30 20:26 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 20:11 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Bart Van Assche, Keith Busch, linux-block, linux-nvme, linux-scsi,
io-uring, linux-fsdevel, joshi.k, javier.gonz
On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
> >> +}
> >
> > bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> > serialized against the code that tests bits in bdev->write_hint_mask?
>
> It needs something. I actually pointed that out last round, but forgot
> about it again this time :)
I disagree. Whether we serialize it or not, writes in flight will either
think it can write or it won't. There's no point adding any overhead to
the IO path for this as you can't stop ending up with inflight writes
using the tag you're trying to turn off.
This is the same as the "ro" attribute. If you're changing it with
BLKROSET ioctl with writes in flight, some of those writes may get past
blkdev_write_iter's read-only check, others may not. No serialization
done there, and this pretty much the same thing.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-30 20:11 ` Keith Busch
@ 2024-10-30 20:26 ` Bart Van Assche
2024-10-30 20:37 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-10-30 20:26 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz
On 10/30/24 1:11 PM, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
>> On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
>>>> +}
>>>
>>> bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
>>> serialized against the code that tests bits in bdev->write_hint_mask?
>>
>> It needs something. I actually pointed that out last round, but forgot
>> about it again this time :)
>
> I disagree. Whether we serialize it or not, writes in flight will either
> think it can write or it won't. There's no point adding any overhead to
> the IO path for this as you can't stop ending up with inflight writes
> using the tag you're trying to turn off.
Shouldn't the request queue be frozen while this write_hint_mask bitmap
is modified, just like the request queue is frozen while queue limits
are updated? This change wouldn't add any additional overhead to the I/O
path.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-30 20:26 ` Bart Van Assche
@ 2024-10-30 20:37 ` Keith Busch
2024-10-30 21:15 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 20:37 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz
On Wed, Oct 30, 2024 at 01:26:38PM -0700, Bart Van Assche wrote:
> On 10/30/24 1:11 PM, Keith Busch wrote:
> > On Wed, Oct 30, 2024 at 05:46:58AM +0100, Christoph Hellwig wrote:
> > > On Tue, Oct 29, 2024 at 10:25:11AM -0700, Bart Van Assche wrote:
> > > > > +}
> > > >
> > > > bitmap_copy() is not atomic. Shouldn't the bitmap_copy() call be
> > > > serialized against the code that tests bits in bdev->write_hint_mask?
> > >
> > > It needs something. I actually pointed that out last round, but forgot
> > > about it again this time :)
> >
> > I disagree. Whether we serialize it or not, writes in flight will either
> > think it can write or it won't. There's no point adding any overhead to
> > the IO path for this as you can't stop ending up with inflight writes
> > using the tag you're trying to turn off.
>
> Shouldn't the request queue be frozen while this write_hint_mask bitmap
> is modified, just like the request queue is frozen while queue limits
> are updated? This change wouldn't add any additional overhead to the I/O
> path.
The partitions don't have a queue. If we need to freeze, then changing
one partition's available hints harms IO to other partitions.
Also, block direct IO creates the bio before it freezes. Freezing would
only get writes using the hint you're trying to disable queue up after
all the checks have been done, so you still can't stop making inflight
writes with freeze.
But if by "not atomic", if you're just saying we need a barrier on the
bitmap_copy, like smp_mb__after_atomic(), then yeah, I see that's
probably appropriate here.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 4/9] block: allow ability to limit partition write hints
2024-10-30 20:37 ` Keith Busch
@ 2024-10-30 21:15 ` Bart Van Assche
0 siblings, 0 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-10-30 21:15 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz
On 10/30/24 1:37 PM, Keith Busch wrote:
> But if by "not atomic", if you're just saying we need a barrier on the
> bitmap_copy, like smp_mb__after_atomic(), then yeah, I see that's
> probably appropriate here.
smp_mb__after_atomic() follows atomic operations. bitmap_copy() does not
use any kind of atomic operation.
I'm wondering whether introducing a variant of bitmap_copy() that uses
WRITE_ONCE() or smp_store_release() would be appropriate.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* [PATCHv10 5/9] block, fs: add write hint to kiocb
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (3 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 4/9] block: allow ability to limit partition write hints Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
` (5 subsequent siblings)
10 siblings, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's allowed hints.
Signed-off-by: Keith Busch <[email protected]>
---
block/fops.c | 31 ++++++++++++++++++++++++++++---
include/linux/fs.h | 1 +
2 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index 2d01c90076813..bb3855ee044f0 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -71,7 +71,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio.bi_write_hint = iocb->ki_write_hint;
bio.bi_ioprio = iocb->ki_ioprio;
if (iocb->ki_flags & IOCB_ATOMIC)
bio.bi_opf |= REQ_ATOMIC;
@@ -200,7 +200,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
for (;;) {
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -316,7 +316,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
dio->flags = 0;
dio->iocb = iocb;
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_end_io = blkdev_bio_end_io_async;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -362,6 +362,23 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
return -EIOCBQUEUED;
}
+static int blkdev_write_hint(struct kiocb *iocb, struct block_device *bdev)
+{
+ u16 hint = iocb->ki_write_hint;
+
+ if (!hint)
+ return file_inode(iocb->ki_filp)->i_write_hint;
+
+ if (hint > bdev_max_write_hints(bdev))
+ return -EINVAL;
+
+ if (bdev_is_partition(bdev) &&
+ !test_bit(hint - 1, bdev->write_hint_mask))
+ return -EINVAL;
+
+ return hint;
+}
+
static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
@@ -373,6 +390,14 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (blkdev_dio_invalid(bdev, iocb, iter))
return -EINVAL;
+ if (iov_iter_rw(iter) == WRITE) {
+ int hint = blkdev_write_hint(iocb, bdev);
+
+ if (hint < 0)
+ return hint;
+ iocb->ki_write_hint = hint;
+ }
+
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
if (likely(nr_pages <= BIO_MAX_VECS)) {
if (is_sync_kiocb(iocb))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b5cad44a1268..1a00accf412e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
void *private;
int ki_flags;
u16 ki_ioprio; /* See linux/ioprio.h */
+ u16 ki_write_hint;
union {
/*
* Only used for async buffered reads, where it denotes the
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* [PATCHv10 6/9] io_uring: enable per-io hinting capability
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (4 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 5/9] block, fs: add write hint to kiocb Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-11-07 2:09 ` Jens Axboe
2024-10-29 15:19 ` [PATCHv10 7/9] block: export placement hint feature Keith Busch
` (4 subsequent siblings)
10 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche,
Nitesh Shetty, Keith Busch
From: Kanchan Joshi <[email protected]>
With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes will be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.
Allow userspace to pass additional metadata in the SQE.
__u16 write_hint;
If the hint is provided, filesystems may optionally use it. A filesytem
may ignore this field if it does not support per-io hints, or if the
value is invalid for its backing storage. Just like the inode hints,
requesting values that are not supported by the hardware are not an
error.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/uapi/linux/io_uring.h | 4 ++++
io_uring/io_uring.c | 2 ++
io_uring/rw.c | 3 ++-
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0247452837830..6e1985d3b306c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@ struct io_uring_sqe {
__u16 addr_len;
__u16 __pad3[1];
};
+ struct {
+ __u16 write_hint;
+ __u16 __pad4[1];
+ };
};
union {
struct {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4514644fdf52e..65b8e29b67ec7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,7 +3875,9 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_SQE_ELEM(44, __u32, file_index);
BUILD_BUG_SQE_ELEM(44, __u16, addr_len);
+ BUILD_BUG_SQE_ELEM(44, __u16, write_hint);
BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]);
+ BUILD_BUG_SQE_ELEM(46, __u16, __pad4[0]);
BUILD_BUG_SQE_ELEM(48, __u64, addr3);
BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
BUILD_BUG_SQE_ELEM(56, __u64, __pad2);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 7ce1cbc048faf..b5dea58356d93 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->kiocb.ki_ioprio = get_current_ioprio();
}
rw->kiocb.dio_complete = NULL;
-
+ if (ddir == ITER_SOURCE)
+ rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
rw->addr = READ_ONCE(sqe->addr);
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 6/9] io_uring: enable per-io hinting capability
2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
@ 2024-11-07 2:09 ` Jens Axboe
0 siblings, 0 replies; 106+ messages in thread
From: Jens Axboe @ 2024-11-07 2:09 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche,
Nitesh Shetty, Keith Busch
On 10/29/24 9:19 AM, Keith Busch wrote:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 0247452837830..6e1985d3b306c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -92,6 +92,10 @@ struct io_uring_sqe {
> __u16 addr_len;
> __u16 __pad3[1];
> };
> + struct {
> + __u16 write_hint;
> + __u16 __pad4[1];
> + };
Might make more sense to have this overlap further down, with the
passthrough command. That'd put it solidly out of anything that isn't
passthrough or needs addr3.
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 7ce1cbc048faf..b5dea58356d93 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> rw->kiocb.ki_ioprio = get_current_ioprio();
> }
> rw->kiocb.dio_complete = NULL;
> -
> + if (ddir == ITER_SOURCE)
> + rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
> rw->addr = READ_ONCE(sqe->addr);
> rw->len = READ_ONCE(sqe->len);
> rw->flags = READ_ONCE(sqe->rw_flags);
Can't we just read it unconditionally? I know it's a write hint, hence
why checking for ITER_SOURCE, but if we can just set it regardless, then
we don't need to branch around that.
--
Jens Axboe
^ permalink raw reply [flat|nested] 106+ messages in thread
* [PATCHv10 7/9] block: export placement hint feature
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (5 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 6/9] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
` (3 subsequent siblings)
10 siblings, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch
From: Keith Busch <[email protected]>
Add a feature flag for devices that support generic placement hints in
write commands. This is in contrast to data lifetime hints.
Signed-off-by: Keith Busch <[email protected]>
---
block/blk-settings.c | 2 ++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 3 +++
3 files changed, 8 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 921fb4d334fa4..7e3bb3cec7032 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -518,6 +518,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->features &= ~BLK_FEAT_NOWAIT;
if (!(b->features & BLK_FEAT_POLL))
t->features &= ~BLK_FEAT_POLL;
+ if (!(b->features & BLK_FEAT_PLACEMENT_HINTS))
+ t->features &= ~BLK_FEAT_PLACEMENT_HINTS;
t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 85f48ca461049..51e0b99c2210a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -260,6 +260,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
+QUEUE_SYSFS_FEATURE_SHOW(placement_hints, BLK_FEAT_PLACEMENT_HINTS);
static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
{
@@ -497,6 +498,7 @@ QUEUE_RW_ENTRY(queue_poll_delay, "io_poll_delay");
QUEUE_RW_ENTRY(queue_wc, "write_cache");
QUEUE_RO_ENTRY(queue_fua, "fua");
QUEUE_RO_ENTRY(queue_dax, "dax");
+QUEUE_RO_ENTRY(queue_placement_hints, "placement_hints");
QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
QUEUE_RO_ENTRY(queue_virt_boundary_mask, "virt_boundary_mask");
QUEUE_RO_ENTRY(queue_dma_alignment, "dma_alignment");
@@ -626,6 +628,7 @@ static struct attribute *queue_attrs[] = {
&queue_wc_entry.attr,
&queue_fua_entry.attr,
&queue_dax_entry.attr,
+ &queue_placement_hints_entry.attr,
&queue_poll_delay_entry.attr,
&queue_virt_boundary_mask_entry.attr,
&queue_dma_alignment_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 90a19ea889276..c577a607591de 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -333,6 +333,9 @@ typedef unsigned int __bitwise blk_features_t;
#define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \
((__force blk_features_t)(1u << 15))
+/* supports generic write placement hints */
+#define BLK_FEAT_PLACEMENT_HINTS ((__force blk_features_t)(1u << 16))
+
/*
* Flags automatically inherited when stacking limits.
*/
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* [PATCHv10 8/9] nvme: enable FDP support
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (6 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 7/9] block: export placement hint feature Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-30 0:24 ` Chaitanya Kulkarni
2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
` (2 subsequent siblings)
10 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Hui Qi,
Nitesh Shetty, Hannes Reinecke, Keith Busch
From: Kanchan Joshi <[email protected]>
Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.
Userspace can send the write hint information using io_uring or fcntl.
Fetch the placement-identifiers if the device supports FDP. The incoming
write-hint is mapped to a placement-identifier, which in turn is set in
the DSPEC field of the write command.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 5 +++
include/linux/nvme.h | 19 +++++++++
3 files changed, 108 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3de7555a7de74..bd7b89912ddb9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
bool is_removed;
};
+struct nvme_fdp_ruh_status_desc {
+ u16 pid;
+ u16 ruhid;
+ u32 earutr;
+ u64 ruamw;
+ u8 rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+ u8 rsvd0[14];
+ __le16 nruhsd;
+ struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
unsigned int admin_timeout = 60;
module_param(admin_timeout, uint, 0644);
MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
ida_free(&head->subsys->ns_ida, head->instance);
cleanup_srcu_struct(&head->srcu);
nvme_put_subsystem(head->subsys);
+ kfree(head->plids);
kfree(head);
}
@@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+ if (req->write_hint && ns->head->nr_plids) {
+ u16 hint = max(req->write_hint, ns->head->nr_plids);
+
+ dsmgmt |= ns->head->plids[hint - 1] << 16;
+ control |= NVME_RW_DTYPE_DPLCMT;
+ }
+
if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
return BLK_STS_INVAL;
@@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
return ret;
}
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+ struct nvme_fdp_ruh_status_desc *ruhsd;
+ struct nvme_ns_head *head = ns->head;
+ struct nvme_fdp_ruh_status *ruhs;
+ struct nvme_command c = {};
+ int size, ret, i;
+
+ if (head->plids)
+ return 0;
+
+ size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+ ruhs = kzalloc(size, GFP_KERNEL);
+ if (!ruhs)
+ return -ENOMEM;
+
+ c.imr.opcode = nvme_cmd_io_mgmt_recv;
+ c.imr.nsid = cpu_to_le32(nsid);
+ c.imr.mo = 0x1;
+ c.imr.numd = cpu_to_le32((size >> 2) - 1);
+
+ ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+ if (ret)
+ goto out;
+
+ i = le16_to_cpu(ruhs->nruhsd);
+ if (!i)
+ goto out;
+
+ ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
+ head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
+ GFP_KERNEL);
+ if (!head->plids) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ for (i = 0; i < ns->head->nr_plids; i++) {
+ ruhsd = &ruhs->ruhsd[i];
+ head->plids[i] = le16_to_cpu(ruhsd->pid);
+ }
+out:
+ kfree(ruhs);
+ return ret;
+}
+
static int nvme_update_ns_info_block(struct nvme_ns *ns,
struct nvme_ns_info *info)
{
@@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
goto out;
}
+ if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+ ret = nvme_fetch_fdp_plids(ns, info->nsid);
+ if (ret)
+ dev_warn(ns->ctrl->device,
+ "FDP failure status:0x%x\n", ret);
+ if (ret < 0)
+ goto out;
+ } else {
+ ns->head->nr_plids = 0;
+ kfree(ns->head->plids);
+ ns->head->plids = NULL;
+ }
+
blk_mq_freeze_queue(ns->disk->queue);
ns->head->lba_shift = id->lbaf[lbaf].ds;
ns->head->nuse = le64_to_cpu(id->nuse);
@@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
if (!nvme_init_integrity(ns->head, &lim, info))
capacity = 0;
+ lim.max_write_hints = ns->head->nr_plids;
+ if (lim.max_write_hints)
+ lim.features |= BLK_FEAT_PLACEMENT_HINTS;
ret = queue_limits_commit_update(ns->disk->queue, &lim);
if (ret) {
blk_mq_unfreeze_queue(ns->disk->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536b..cec8e5d96377b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
u8 csi;
};
+#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16))
+
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,9 @@ struct nvme_ns_head {
struct device cdev_device;
struct gendisk *disk;
+
+ u16 nr_plids;
+ u16 *plids;
#ifdef CONFIG_NVME_MULTIPATH
struct bio_list requeue_list;
spinlock_t requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e0..a954eaee5b0f3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
NVME_CTRL_ATTR_HID_128_BIT = (1 << 0),
NVME_CTRL_ATTR_TBKAS = (1 << 6),
NVME_CTRL_ATTR_ELBAS = (1 << 15),
+ NVME_CTRL_ATTR_FDPS = (1 << 19),
};
struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
nvme_cmd_resv_register = 0x0d,
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
+ nvme_cmd_io_mgmt_recv = 0x12,
nvme_cmd_resv_release = 0x15,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_register), \
nvme_opcode_name(nvme_cmd_resv_report), \
nvme_opcode_name(nvme_cmd_resv_acquire), \
+ nvme_opcode_name(nvme_cmd_io_mgmt_recv), \
nvme_opcode_name(nvme_cmd_resv_release), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
@@ -1015,6 +1018,7 @@ enum {
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
NVME_RW_DTYPE_STREAMS = 1 << 4,
+ NVME_RW_DTYPE_DPLCMT = 2 << 4,
NVME_WZ_DEAC = 1 << 9,
};
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
__le32 cdw14[2];
};
+struct nvme_io_mgmt_recv_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __u8 mo;
+ __u8 rsvd11;
+ __u16 mos;
+ __le32 numd;
+ __le32 cdw12[4];
+};
+
enum {
NVME_ZRA_ZONE_REPORT = 0,
NVME_ZRASF_ZONE_REPORT_ALL = 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
struct nvmf_auth_receive_command auth_receive;
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
+ struct nvme_io_mgmt_recv_cmd imr;
};
};
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 8/9] nvme: enable FDP support
2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
@ 2024-10-30 0:24 ` Chaitanya Kulkarni
0 siblings, 0 replies; 106+ messages in thread
From: Chaitanya Kulkarni @ 2024-10-30 0:24 UTC (permalink / raw)
To: Keith Busch
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], Hui Qi,
Nitesh Shetty, Hannes Reinecke, Keith Busch,
[email protected]
On 10/29/24 08:19, Keith Busch wrote:
> From: Kanchan Joshi <[email protected]>
>
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> Userspace can send the write hint information using io_uring or fcntl.
>
> Fetch the placement-identifiers if the device supports FDP. The incoming
> write-hint is mapped to a placement-identifier, which in turn is set in
> the DSPEC field of the write command.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Hui Qi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 5 +++
> include/linux/nvme.h | 19 +++++++++
> 3 files changed, 108 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3de7555a7de74..bd7b89912ddb9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -44,6 +44,20 @@ struct nvme_ns_info {
> bool is_removed;
> };
>
> +struct nvme_fdp_ruh_status_desc {
> + u16 pid;
> + u16 ruhid;
> + u32 earutr;
> + u64 ruamw;
> + u8 rsvd16[16];
> +};
> +
> +struct nvme_fdp_ruh_status {
> + u8 rsvd0[14];
> + __le16 nruhsd;
> + struct nvme_fdp_ruh_status_desc ruhsd[];
> +};
> +
> unsigned int admin_timeout = 60;
> module_param(admin_timeout, uint, 0644);
> MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
> @@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
> ida_free(&head->subsys->ns_ida, head->instance);
> cleanup_srcu_struct(&head->srcu);
> nvme_put_subsystem(head->subsys);
> + kfree(head->plids);
> kfree(head);
> }
>
> @@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
> if (req->cmd_flags & REQ_RAHEAD)
> dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>
> + if (req->write_hint && ns->head->nr_plids) {
> + u16 hint = max(req->write_hint, ns->head->nr_plids);
> +
> + dsmgmt |= ns->head->plids[hint - 1] << 16;
> + control |= NVME_RW_DTYPE_DPLCMT;
> + }
> +
> if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
> return BLK_STS_INVAL;
>
> @@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
> return ret;
> }
>
> +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
> +{
> + struct nvme_fdp_ruh_status_desc *ruhsd;
> + struct nvme_ns_head *head = ns->head;
> + struct nvme_fdp_ruh_status *ruhs;
> + struct nvme_command c = {};
> + int size, ret, i;
> +
> + if (head->plids)
> + return 0;
> +
> + size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
> + ruhs = kzalloc(size, GFP_KERNEL);
> + if (!ruhs)
> + return -ENOMEM;
> +
> + c.imr.opcode = nvme_cmd_io_mgmt_recv;
> + c.imr.nsid = cpu_to_le32(nsid);
> + c.imr.mo = 0x1;
can we please add some comment where values are hardcoded ?
> + c.imr.numd = cpu_to_le32((size >> 2) - 1);
> +
> + ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
> + if (ret)
> + goto out;
> +
> + i = le16_to_cpu(ruhs->nruhsd);
instead of i why can't we use local variable nr_plids ?
> + if (!i)
> + goto out;
> +
> + ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
> + head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
> + GFP_KERNEL);
> + if (!head->plids) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + for (i = 0; i < ns->head->nr_plids; i++) {
> + ruhsd = &ruhs->ruhsd[i];
> + head->plids[i] = le16_to_cpu(ruhsd->pid);
> + }
> +out:
> + kfree(ruhs);
> + return ret;
> +}
> +
> static int nvme_update_ns_info_block(struct nvme_ns *ns,
> struct nvme_ns_info *info)
> {
> @@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
> goto out;
> }
>
> + if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
> + ret = nvme_fetch_fdp_plids(ns, info->nsid);
> + if (ret)
> + dev_warn(ns->ctrl->device,
> + "FDP failure status:0x%x\n", ret);
> + if (ret < 0)
> + goto out;
> + } else {
> + ns->head->nr_plids = 0;
> + kfree(ns->head->plids);
> + ns->head->plids = NULL;
> + }
> +
> blk_mq_freeze_queue(ns->disk->queue);
> ns->head->lba_shift = id->lbaf[lbaf].ds;
> ns->head->nuse = le64_to_cpu(id->nuse);
> @@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
> if (!nvme_init_integrity(ns->head, &lim, info))
> capacity = 0;
>
> + lim.max_write_hints = ns->head->nr_plids;
> + if (lim.max_write_hints)
> + lim.features |= BLK_FEAT_PLACEMENT_HINTS;
> ret = queue_limits_commit_update(ns->disk->queue, &lim);
> if (ret) {
> blk_mq_unfreeze_queue(ns->disk->queue);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 093cb423f536b..cec8e5d96377b 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -454,6 +454,8 @@ struct nvme_ns_ids {
> u8 csi;
> };
>
> +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16))
this calculates how many plids can fit into the ctrl page size ?
sorry but I didn't understand sizeof(16) here, since plids are u16
nvme_ns_head -> u16 *plidsshould this be sizeof(u16) ? -ck
^ permalink raw reply [flat|nested] 106+ messages in thread
* [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (7 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 8/9] nvme: enable FDP support Keith Busch
@ 2024-10-29 15:19 ` Keith Busch
2024-10-29 15:26 ` Christoph Hellwig
2024-10-29 15:24 ` [PATCHv10 0/9] write hints with nvme fdp, scsi streams Christoph Hellwig
2024-11-05 15:50 ` Christoph Hellwig
10 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:19 UTC (permalink / raw)
To: linux-block, linux-nvme, linux-scsi, io-uring
Cc: linux-fsdevel, hch, joshi.k, javier.gonz, bvanassche, Keith Busch,
Hannes Reinecke
From: Keith Busch <[email protected]>
The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
drivers/scsi/sd.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76adc..235dd6e5b6688 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3768,6 +3768,8 @@ static int sd_revalidate_disk(struct gendisk *disk)
sd_config_protection(sdkp, &lim);
}
+ lim.max_write_hints = sdkp->permanent_stream_count;
+
/*
* We now have all cache related info, determine how we deal
* with flush requests.
--
2.43.5
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-29 15:26 ` Christoph Hellwig
2024-10-29 15:34 ` Keith Busch
2024-10-29 17:18 ` Bart Van Assche
0 siblings, 2 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:26 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
joshi.k, javier.gonz, bvanassche, Keith Busch, Hannes Reinecke
On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
>
> Reviewed-by: Bart Van Assche <[email protected]>
> Reviewed-by: Hannes Reinecke <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
Despite the reviews this is still incorrect. The permanent streams have
a relative data temperature associated with them as pointed out last
round and are not arbitrary write stream contexts despite (ab)using
the SBC streams facilities.
Bart, btw: I think the current sd implementation is buggy as well, as
it assumes the permanent streams are ordered by their data temperature
in the IO Advise hints mode page, but I can't find anything in the
spec that requires a particular ordering.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:26 ` Christoph Hellwig
@ 2024-10-29 15:34 ` Keith Busch
2024-10-29 15:37 ` Christoph Hellwig
2024-10-29 17:18 ` Bart Van Assche
1 sibling, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:34 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 04:26:54PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> >
> > The block limits exports the number of write hints, so set this limit if
> > the device reports support for the lifetime hints. Not only does this
> > inform the user of which hints are possible, it also allows scsi devices
> > supporting the feature to utilize the full range through raw block
> > device direct-io.
> >
> > Reviewed-by: Bart Van Assche <[email protected]>
> > Reviewed-by: Hannes Reinecke <[email protected]>
> > Signed-off-by: Keith Busch <[email protected]>
>
> Despite the reviews this is still incorrect. The permanent streams have
> a relative data temperature associated with them as pointed out last
> round and are not arbitrary write stream contexts despite (ab)using
> the SBC streams facilities.
So then don't use it that way? I still don't know what change you're
expecting to happen with this feedback. What do you want the kernel to
do differently here?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:34 ` Keith Busch
@ 2024-10-29 15:37 ` Christoph Hellwig
2024-10-29 15:38 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:37 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> So then don't use it that way? I still don't know what change you're
> expecting to happen with this feedback. What do you want the kernel to
> do differently here?
Same as before: don't expose them as write streams, because they
aren't. A big mess in this series going back to the versions before
your involvement is that they somehow want to tie up the temperature
hints with the stream separation, which just ends up very messy.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:37 ` Christoph Hellwig
@ 2024-10-29 15:38 ` Keith Busch
2024-10-29 15:53 ` Christoph Hellwig
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 15:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 04:37:02PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> > So then don't use it that way? I still don't know what change you're
> > expecting to happen with this feedback. What do you want the kernel to
> > do differently here?
>
> Same as before: don't expose them as write streams, because they
> aren't. A big mess in this series going back to the versions before
> your involvement is that they somehow want to tie up the temperature
> hints with the stream separation, which just ends up very messy.
They're not exposed as write streams. Patch 7/9 sets the feature if it
is a placement id or not, and only nvme sets it, so scsi's attributes
are not claiming to be a write stream.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:38 ` Keith Busch
@ 2024-10-29 15:53 ` Christoph Hellwig
2024-10-29 16:22 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:53 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> They're not exposed as write streams. Patch 7/9 sets the feature if it
> is a placement id or not, and only nvme sets it, so scsi's attributes
> are not claiming to be a write stream.
So it shows up in sysfs, but:
- queue_max_write_hints (which really should be queue_max_write_streams)
still picks it up, and from there the statx interface
- per-inode fcntl hint that encode a temperature still magically
get dumpted into the write streams if they are set.
In other words it's a really leaky half-backed abstraction.
Let's brainstorm how it could be done better:
- the max_write_streams values only set by block devices that actually
do support write streams, and not the fire and forget temperature
hints. They way this is queried is by having a non-zero value
there, not need for an extra flag.
- but the struct file (or maybe inode) gets a supported flag, as stream
separation needs to be supported by the file system
- a separate fcntl is used to set per-inode streams (if you care about
that, seem like the bdev use case focusses on per-I/O). In that case
we'd probably also need a separate inode field for them, or a somewhat
complicated scheme to decide what is stored in the inode field if there
is only one.
- for block devices bdev/fops.c maps the temperature hints into write
streams if write streams are supported, any user that mixes and
matches write streams and temperature hints gets what they deserve
- this could also be a helper for file systems that want to do the
same.
Just a quick writeup while I'm on the run, there's probably a hole or
two that could be poked into it.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:53 ` Christoph Hellwig
@ 2024-10-29 16:22 ` Keith Busch
2024-10-30 4:55 ` Christoph Hellwig
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-29 16:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > is a placement id or not, and only nvme sets it, so scsi's attributes
> > are not claiming to be a write stream.
>
> So it shows up in sysfs, but:
>
> - queue_max_write_hints (which really should be queue_max_write_streams)
> still picks it up, and from there the statx interface
>
> - per-inode fcntl hint that encode a temperature still magically
> get dumpted into the write streams if they are set.
>
> In other words it's a really leaky half-backed abstraction.
Exactly why I asked last time: "who uses it and how do you want them to
use it" :)
> Let's brainstorm how it could be done better:
>
> - the max_write_streams values only set by block devices that actually
> do support write streams, and not the fire and forget temperature
> hints. They way this is queried is by having a non-zero value
> there, not need for an extra flag.
So we need a completely different attribute for SCSI's permanent write
streams? You'd mentioned earlier you were okay with having SCSI be able
to utilized per-io raw block write hints. Having multiple things to
check for what are all just write classifiers seems unnecessarily
complicated.
> - but the struct file (or maybe inode) gets a supported flag, as stream
> separation needs to be supported by the file system
> - a separate fcntl is used to set per-inode streams (if you care about
> that, seem like the bdev use case focusses on per-I/O). In that case
> we'd probably also need a separate inode field for them, or a somewhat
> complicated scheme to decide what is stored in the inode field if there
> is only one.
No need to create a new fcntl. The people already testing this are
successfully using FDP with the existing fcntl hints. Their applications
leverage FDP as way to separate files based on expected lifetime. It is
how they want to use it and it is working above expectations.
> - for block devices bdev/fops.c maps the temperature hints into write
> streams if write streams are supported, any user that mixes and
> matches write streams and temperature hints gets what they deserve
That's fine. This patch series pretty much accomplishes that part.
> - this could also be a helper for file systems that want to do the
> same.
>
> Just a quick writeup while I'm on the run, there's probably a hole or
> two that could be poked into it.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 16:22 ` Keith Busch
@ 2024-10-30 4:55 ` Christoph Hellwig
2024-10-30 15:41 ` Keith Busch
2024-10-30 16:59 ` Bart Van Assche
0 siblings, 2 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 4:55 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > > is a placement id or not, and only nvme sets it, so scsi's attributes
> > > are not claiming to be a write stream.
> >
> > So it shows up in sysfs, but:
> >
> > - queue_max_write_hints (which really should be queue_max_write_streams)
> > still picks it up, and from there the statx interface
> >
> > - per-inode fcntl hint that encode a temperature still magically
> > get dumpted into the write streams if they are set.
> >
> > In other words it's a really leaky half-backed abstraction.
>
> Exactly why I asked last time: "who uses it and how do you want them to
> use it" :)
For the temperature hints the only public user I known is rocksdb, and
that only started working when Hans fixed a brown paperbag bug in the
rocksdb code a while ago. Given that f2fs interprets the hints I suspect
something in the Android world does as well, maybe Bart knows more.
For the separate write streams the usage I want for them is poor mans
zones - e.g. write N LBAs sequentially into a separate write streams
and then eventually discard them together. This will fit nicely into
f2fs and the pending xfs work as well as quite a few userspace storage
systems. For that the file system or application needs to query
the number of available write streams (and in the bitmap world their
numbers of they are distontigous) and the size your can fit into the
"reclaim unit" in FDP terms. I've not been bothering you much with
the latter as it is an easy retrofit once the I/O path bits lands.
> > Let's brainstorm how it could be done better:
> >
> > - the max_write_streams values only set by block devices that actually
> > do support write streams, and not the fire and forget temperature
> > hints. They way this is queried is by having a non-zero value
> > there, not need for an extra flag.
>
> So we need a completely different attribute for SCSI's permanent write
> streams? You'd mentioned earlier you were okay with having SCSI be able
> to utilized per-io raw block write hints. Having multiple things to
> check for what are all just write classifiers seems unnecessarily
> complicated.
I don't think the multiple write streams interface applies to SCSIs
write streams, as they enforce a relative temperature, and they don't
have the concept of how much you can write into an "reclaim unit".
OTOH there isn't much you need to query for them anyway, as the
temperature hints have always been defined as pure hints with all
up and downsides of that.
> No need to create a new fcntl. The people already testing this are
> successfully using FDP with the existing fcntl hints. Their applications
> leverage FDP as way to separate files based on expected lifetime. It is
> how they want to use it and it is working above expectations.
FYI, I think it's always fine and easy to map the temperature hits to
write streams if that's all the driver offers. It loses a lot of the
capapilities, but as long as it doesn't enforce a lower level interface
that never exposes more that's fine.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 4:55 ` Christoph Hellwig
@ 2024-10-30 15:41 ` Keith Busch
2024-10-30 15:45 ` Christoph Hellwig
2024-10-30 16:59 ` Bart Van Assche
1 sibling, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 15:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
>
> > No need to create a new fcntl. The people already testing this are
> > successfully using FDP with the existing fcntl hints. Their applications
> > leverage FDP as way to separate files based on expected lifetime. It is
> > how they want to use it and it is working above expectations.
>
> FYI, I think it's always fine and easy to map the temperature hits to
> write streams if that's all the driver offers. It loses a lot of the
> capapilities, but as long as it doesn't enforce a lower level interface
> that never exposes more that's fine.
But that's just the v2 from this sequence:
https://lore.kernel.org/linux-nvme/[email protected]/
If you're okay with it now, then let's just go with that and I'm happy
continue iterating on the rest separately.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 15:41 ` Keith Busch
@ 2024-10-30 15:45 ` Christoph Hellwig
2024-10-30 15:48 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 15:45 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> >
> > > No need to create a new fcntl. The people already testing this are
> > > successfully using FDP with the existing fcntl hints. Their applications
> > > leverage FDP as way to separate files based on expected lifetime. It is
> > > how they want to use it and it is working above expectations.
> >
> > FYI, I think it's always fine and easy to map the temperature hits to
> > write streams if that's all the driver offers. It loses a lot of the
> > capapilities, but as long as it doesn't enforce a lower level interface
> > that never exposes more that's fine.
>
> But that's just the v2 from this sequence:
>
> https://lore.kernel.org/linux-nvme/[email protected]/
>
> If you're okay with it now, then let's just go with that and I'm happy
> continue iterating on the rest separately.
That's exactly what I do not want - it takes the temperature hints
and force them into the write streams down in the driver with no
way to make actually useful use of the stream separation.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 15:45 ` Christoph Hellwig
@ 2024-10-30 15:48 ` Keith Busch
2024-10-30 15:50 ` Christoph Hellwig
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 15:48 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 04:45:56PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> > On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> > >
> > > > No need to create a new fcntl. The people already testing this are
> > > > successfully using FDP with the existing fcntl hints. Their applications
> > > > leverage FDP as way to separate files based on expected lifetime. It is
> > > > how they want to use it and it is working above expectations.
> > >
> > > FYI, I think it's always fine and easy to map the temperature hits to
> > > write streams if that's all the driver offers. It loses a lot of the
> > > capapilities, but as long as it doesn't enforce a lower level interface
> > > that never exposes more that's fine.
> >
> > But that's just the v2 from this sequence:
> >
> > https://lore.kernel.org/linux-nvme/[email protected]/
> >
> > If you're okay with it now, then let's just go with that and I'm happy
> > continue iterating on the rest separately.
>
> That's exactly what I do not want - it takes the temperature hints
> and force them into the write streams down in the driver
What??? You said to map the temperature hints to a write stream. The
driver offers that here. But you specifically don't want that? I'm so
confused.
> with no way to make actually useful use of the stream separation.
Have you tried it? The people who actually do easily demonstrate it is
in fact very useful.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 15:48 ` Keith Busch
@ 2024-10-30 15:50 ` Christoph Hellwig
2024-10-30 16:42 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 15:50 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> What??? You said to map the temperature hints to a write stream. The
> driver offers that here. But you specifically don't want that? I'm so
> confused.
In bdev/fops.c (or file systems if they want to do that) not down in the
driver forced down everyones throat. Which was the whole point of the
discussion that we're running in circles here.
> > with no way to make actually useful use of the stream separation.
>
> Have you tried it? The people who actually do easily demonstrate it is
> in fact very useful.
While I've read the claim multiple times, I've not actually seen any
numbers.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 15:50 ` Christoph Hellwig
@ 2024-10-30 16:42 ` Keith Busch
2024-10-30 16:57 ` Christoph Hellwig
0 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 16:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > What??? You said to map the temperature hints to a write stream. The
> > driver offers that here. But you specifically don't want that? I'm so
> > confused.
>
> In bdev/fops.c (or file systems if they want to do that) not down in the
> driver forced down everyones throat. Which was the whole point of the
> discussion that we're running in circles here.
That makes no sense. A change completely isolated to a driver isn't
forcing anything on anyone. It's the upper layers that's forcing this
down, whether the driver uses it or not: the hints are already getting
to the driver, but the driver currently doesn't use it. Finding a way to
use them is not some force to be demonized...
> > > with no way to make actually useful use of the stream separation.
> >
> > Have you tried it? The people who actually do easily demonstrate it is
> > in fact very useful.
>
> While I've read the claim multiple times, I've not actually seen any
> numbers.
Here's something recent from rocksdb developers running ycsb worklada
benchmark. The filesystem used is XFS.
It sets temperature hints for different SST levels, which already
happens today. The last data point made some minor changes with
level-to-hint mapping.
Without FDP:
WAF: 2.72
IOPS: 1465
READ LAT: 2681us
UPDATE LAT: 3115us
With FDP (rocksdb unmodified):
WAF: 2.26
IOPS: 1473
READ LAT: 2415us
UPDATE LAT: 2807us
With FDP (with some minor rocksdb changes):
WAF: 1.67
IOPS: 1547
READ LAT: 1978us
UPDATE LAT: 2267us
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 16:42 ` Keith Busch
@ 2024-10-30 16:57 ` Christoph Hellwig
2024-10-30 17:05 ` Keith Busch
` (2 more replies)
0 siblings, 3 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 16:57 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > > What??? You said to map the temperature hints to a write stream. The
> > > driver offers that here. But you specifically don't want that? I'm so
> > > confused.
> >
> > In bdev/fops.c (or file systems if they want to do that) not down in the
> > driver forced down everyones throat. Which was the whole point of the
> > discussion that we're running in circles here.
>
> That makes no sense. A change completely isolated to a driver isn't
> forcing anything on anyone. It's the upper layers that's forcing this
> down, whether the driver uses it or not: the hints are already getting
> to the driver, but the driver currently doesn't use it.
And once it uses by default, taking it away will have someone scream
regresion, because we're not taking it away form that super special
use case.
> Here's something recent from rocksdb developers running ycsb worklada
> benchmark. The filesystem used is XFS.
Thanks for finally putting something up.
> It sets temperature hints for different SST levels, which already
> happens today. The last data point made some minor changes with
> level-to-hint mapping.
Do you have a pointer to the changes?
> Without FDP:
>
> WAF: 2.72
> IOPS: 1465
> READ LAT: 2681us
> UPDATE LAT: 3115us
>
> With FDP (rocksdb unmodified):
>
> WAF: 2.26
> IOPS: 1473
> READ LAT: 2415us
> UPDATE LAT: 2807us
>
> With FDP (with some minor rocksdb changes):
>
> WAF: 1.67
> IOPS: 1547
> READ LAT: 1978us
> UPDATE LAT: 2267us
Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
which should work just fine with FDP IFF we exposed real write streams,
which roughly double read nad wirte IOPS and reduce the WAF to almost
1 this doesn't look too spectacular to be honest, but it sure it something.
I just wish we could get the real infraѕtructure instead of some band
aid, which makes it really hard to expose the real thing because now
it's been taken up and directly wired to a UAPI.
one
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 16:57 ` Christoph Hellwig
@ 2024-10-30 17:05 ` Keith Busch
2024-10-30 17:15 ` Christoph Hellwig
2024-10-30 17:23 ` Keith Busch
2024-10-30 22:32 ` Keith Busch
2 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 17:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> And once it uses by default, taking it away will have someone scream
> regresion, because we're not taking it away form that super special
> use case.
Refusing to allow something because someone might find it useful has got
to be the worst reasoning I've heard. :)
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 17:05 ` Keith Busch
@ 2024-10-30 17:15 ` Christoph Hellwig
0 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 17:15 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 11:05:03AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > And once it uses by default, taking it away will have someone scream
> > regresion, because we're not taking it away form that super special
> > use case.
>
> Refusing to allow something because someone might find it useful has got
> to be the worst reasoning I've heard. :)
That's not that point. The point is by locking us in we can't actually
do the proper thing. And that's what I'm really worried about.
Especially with the not all that great numbers in the success story.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 16:57 ` Christoph Hellwig
2024-10-30 17:05 ` Keith Busch
@ 2024-10-30 17:23 ` Keith Busch
2024-10-30 22:32 ` Keith Busch
2 siblings, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-10-30 17:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
>
> > It sets temperature hints for different SST levels, which already
> > happens today. The last data point made some minor changes with
> > level-to-hint mapping.
>
> Do you have a pointer to the changes?
The change moves levels 2 and 3 to "MEDIUM" (along with 0 and 1 already
there), 4 to "LONG", and >= 5 remain "EXTREME".
WAL continues to be "SHORT", as before.
> > Without FDP:
> >
> > WAF: 2.72
> > IOPS: 1465
> > READ LAT: 2681us
> > UPDATE LAT: 3115us
> >
> > With FDP (rocksdb unmodified):
> >
> > WAF: 2.26
> > IOPS: 1473
> > READ LAT: 2415us
> > UPDATE LAT: 2807us
> >
> > With FDP (with some minor rocksdb changes):
> >
> > WAF: 1.67
> > IOPS: 1547
> > READ LAT: 1978us
> > UPDATE LAT: 2267us
>
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.
>
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one
This doesn't have to be the end placement streams development. I
fundamentally disagree that this locks anyone in to anything.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 16:57 ` Christoph Hellwig
2024-10-30 17:05 ` Keith Busch
2024-10-30 17:23 ` Keith Busch
@ 2024-10-30 22:32 ` Keith Busch
2024-10-31 8:19 ` Hans Holmberg
2 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-30 22:32 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> > With FDP (with some minor rocksdb changes):
> >
> > WAF: 1.67
> > IOPS: 1547
> > READ LAT: 1978us
> > UPDATE LAT: 2267us
>
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.
Hold up... I absolutely appreciate the work Hans is and has done. But
are you talking about this talk?
https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf
That is very much apples-to-oranges. The B+ isn't on the same device
being evaluated for WAF, where this has all that mixed in. I think the
results are pretty good, all things considered.
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one
I don't know what make of this. I think we're talking past each other.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 22:32 ` Keith Busch
@ 2024-10-31 8:19 ` Hans Holmberg
2024-10-31 13:02 ` Christoph Hellwig
2024-10-31 14:06 ` Keith Busch
0 siblings, 2 replies; 106+ messages in thread
From: Hans Holmberg @ 2024-10-31 8:19 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <[email protected]> wrote:
>
> On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> > > With FDP (with some minor rocksdb changes):
> > >
> > > WAF: 1.67
> > > IOPS: 1547
> > > READ LAT: 1978us
> > > UPDATE LAT: 2267us
> >
> > Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> > which should work just fine with FDP IFF we exposed real write streams,
> > which roughly double read nad wirte IOPS and reduce the WAF to almost
> > 1 this doesn't look too spectacular to be honest, but it sure it something.
>
> Hold up... I absolutely appreciate the work Hans is and has done. But
> are you talking about this talk?
>
> https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf
>
> That is very much apples-to-oranges. The B+ isn't on the same device
> being evaluated for WAF, where this has all that mixed in. I think the
> results are pretty good, all things considered.
No. The meta data IO is just 0.1% of all writes, so that we use a
separate device for that in the benchmark really does not matter.
Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
be content with another 67% of unwanted device side writes on top of
that?
It's of course impossible to compare your benchmark figures and mine
directly since we are using different devices, but hey, we definitely
have an opportunity here to make significant gains for FDP if we just
provide the right kernel interfaces.
Why shouldn't we expose the hardware in a way that enables the users
to make the most out of it?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-31 8:19 ` Hans Holmberg
@ 2024-10-31 13:02 ` Christoph Hellwig
2024-10-31 14:06 ` Keith Busch
1 sibling, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-31 13:02 UTC (permalink / raw)
To: Hans Holmberg
Cc: Keith Busch, Christoph Hellwig, Keith Busch, linux-block,
linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
javier.gonz, bvanassche, Hannes Reinecke
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> No. The meta data IO is just 0.1% of all writes, so that we use a
> separate device for that in the benchmark really does not matter.
>
> Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> be content with another 67% of unwanted device side writes on top of
> that?
>
> It's of course impossible to compare your benchmark figures and mine
> directly since we are using different devices, but hey, we definitely
> have an opportunity here to make significant gains for FDP if we just
> provide the right kernel interfaces.
I'll write code to do a 1:1 single device comparism over the weekend
and Hans will test it once he is back.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-31 8:19 ` Hans Holmberg
2024-10-31 13:02 ` Christoph Hellwig
@ 2024-10-31 14:06 ` Keith Busch
2024-11-01 7:16 ` Hans Holmberg
1 sibling, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-10-31 14:06 UTC (permalink / raw)
To: Hans Holmberg
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <[email protected]> wrote:
> > That is very much apples-to-oranges. The B+ isn't on the same device
> > being evaluated for WAF, where this has all that mixed in. I think the
> > results are pretty good, all things considered.
>
> No. The meta data IO is just 0.1% of all writes, so that we use a
> separate device for that in the benchmark really does not matter.
It's very little spatially, but they overwrite differently than other
data, creating many small holes in large erase blocks.
> Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> be content with another 67% of unwanted device side writes on top of
> that?
>
> It's of course impossible to compare your benchmark figures and mine
> directly since we are using different devices, but hey, we definitely
> have an opportunity here to make significant gains for FDP if we just
> provide the right kernel interfaces.
>
> Why shouldn't we expose the hardware in a way that enables the users
> to make the most out of it?
Because the people using this want this interface. Stalling for the last
6 months hasn't produced anything better, appealing to non-existent
vaporware to block something ready-to-go that satisfies a need right
now is just wasting everyone's time.
Again, I absolutely disagree that this locks anyone in to anything.
That's an overly dramatic excuse.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-31 14:06 ` Keith Busch
@ 2024-11-01 7:16 ` Hans Holmberg
2024-11-01 8:19 ` Javier González
2024-11-01 14:49 ` Keith Busch
0 siblings, 2 replies; 106+ messages in thread
From: Hans Holmberg @ 2024-11-01 7:16 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <[email protected]> wrote:
>
> On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <[email protected]> wrote:
> > > That is very much apples-to-oranges. The B+ isn't on the same device
> > > being evaluated for WAF, where this has all that mixed in. I think the
> > > results are pretty good, all things considered.
> >
> > No. The meta data IO is just 0.1% of all writes, so that we use a
> > separate device for that in the benchmark really does not matter.
>
> It's very little spatially, but they overwrite differently than other
> data, creating many small holes in large erase blocks.
I don't really get how this could influence anything significantly.(If at all).
>
> > Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> > be content with another 67% of unwanted device side writes on top of
> > that?
> >
> > It's of course impossible to compare your benchmark figures and mine
> > directly since we are using different devices, but hey, we definitely
> > have an opportunity here to make significant gains for FDP if we just
> > provide the right kernel interfaces.
> >
> > Why shouldn't we expose the hardware in a way that enables the users
> > to make the most out of it?
>
> Because the people using this want this interface. Stalling for the last
> 6 months hasn't produced anything better, appealing to non-existent
> vaporware to block something ready-to-go that satisfies a need right
> now is just wasting everyone's time.
>
> Again, I absolutely disagree that this locks anyone in to anything.
> That's an overly dramatic excuse.
Locking in or not, to constructively move things forward (if we are
now stuck on how to wire up fs support) I believe it would be
worthwhile to prototype active fdp data placement in xfs and evaluate
it. Happy to help out with that.
Fdp and zns are different beasts, so I don't expect the results in the
presentation to be directly translatable but we can see what we can
do.
Is RocksDB the only file system user at the moment?
Is the benchmark setup/config something that could be shared?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-11-01 7:16 ` Hans Holmberg
@ 2024-11-01 8:19 ` Javier González
2024-11-01 14:49 ` Keith Busch
1 sibling, 0 replies; 106+ messages in thread
From: Javier González @ 2024-11-01 8:19 UTC (permalink / raw)
To: Hans Holmberg
Cc: Keith Busch, Christoph Hellwig, Keith Busch, linux-block,
linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
bvanassche, Hannes Reinecke
On 01.11.2024 08:16, Hans Holmberg wrote:
>Locking in or not, to constructively move things forward (if we are
>now stuck on how to wire up fs support) I believe it would be
>worthwhile to prototype active fdp data placement in xfs and evaluate
>it. Happy to help out with that.
I appreciate you willingness to move things forward. I really mean it.
I have talked several times in this thread about collaborating in the
API that you have in mind. I would _very_ much like to have a common
abstraction for ZNS, ZUFS, FDP, and whatever people build on other
protocols. But without tangible patches showing this, we simply cannot
block this anymore.
>
>Fdp and zns are different beasts, so I don't expect the results in the
>presentation to be directly translatable but we can see what we can
>do.
>
>Is RocksDB the only file system user at the moment?
>Is the benchmark setup/config something that could be shared?
It is a YCSB workload. You have the scripts here:
https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada
If you have other standard workload you want us to run, let me know and
we will post the results in the list too.
We will post the changes to the L3 placement in RocksDB. I think we can
make them available somewhere for you to test before that. Let me come
back to you on this.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-11-01 7:16 ` Hans Holmberg
2024-11-01 8:19 ` Javier González
@ 2024-11-01 14:49 ` Keith Busch
2024-11-06 14:26 ` Hans Holmberg
1 sibling, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-11-01 14:49 UTC (permalink / raw)
To: Hans Holmberg
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote:
> On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <[email protected]> wrote:
> > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > > No. The meta data IO is just 0.1% of all writes, so that we use a
> > > separate device for that in the benchmark really does not matter.
> >
> > It's very little spatially, but they overwrite differently than other
> > data, creating many small holes in large erase blocks.
>
> I don't really get how this could influence anything significantly.(If at all).
Fill your filesystem to near capacity, then continue using it for a few
months. While the filesystem will report some available space, there
may not be many good blocks available to erase. Maybe.
> > Again, I absolutely disagree that this locks anyone in to anything.
> > That's an overly dramatic excuse.
>
> Locking in or not, to constructively move things forward (if we are
> now stuck on how to wire up fs support)
But we're not stuck on how to wire up to fs. That part was settled and
in kernel 10 years ago. We're stuck on wiring it down to the driver,
which should have been the easiest part.
> I believe it would be worthwhile to prototype active fdp data
> placement in xfs and evaluate it. Happy to help out with that.
When are we allowed to conclude evaluation? We have benefits my
customers want on well tested kernels, and wish to proceed now.
I'm not discouraing anyone from continuing further prototypes,
innovations, and improvements. I'd like to spend more time doing that
too, and merging something incrementally better doesn't prevent anyone
from doing that.
> Fdp and zns are different beasts, so I don't expect the results in the
> presentation to be directly translatable but we can see what we can
> do.
>
> Is RocksDB the only file system user at the moment?
Rocks is the only open source one I know about. There are propietary
users, too.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-11-01 14:49 ` Keith Busch
@ 2024-11-06 14:26 ` Hans Holmberg
0 siblings, 0 replies; 106+ messages in thread
From: Hans Holmberg @ 2024-11-06 14:26 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche, Hannes Reinecke
On Fri, Nov 1, 2024 at 3:49 PM Keith Busch <[email protected]> wrote:
>
> On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote:
> > On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <[email protected]> wrote:
> > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > > > No. The meta data IO is just 0.1% of all writes, so that we use a
> > > > separate device for that in the benchmark really does not matter.
> > >
> > > It's very little spatially, but they overwrite differently than other
> > > data, creating many small holes in large erase blocks.
> >
> > I don't really get how this could influence anything significantly.(If at all).
>
> Fill your filesystem to near capacity, then continue using it for a few
> months. While the filesystem will report some available space, there
> may not be many good blocks available to erase. Maybe.
For *this* benchmark workload, the metadata io is such a tiny fraction
so I doubt the impact on wa could be measured.
I completely agree it's a good idea to separate metadata from data
blocks in general. It is actually a good reason for letting the file
system control write stream allocation for all blocks :)
> > I believe it would be worthwhile to prototype active fdp data
> > placement in xfs and evaluate it. Happy to help out with that.
>
> When are we allowed to conclude evaluation? We have benefits my
> customers want on well tested kernels, and wish to proceed now.
Christoph has now wired up prototype support for FDP on top of the
xfs-rt-zoned work + this patch set, and I have had time to look over
it and started doing some testing on HW.
In addition to the FDP support, metadata can also be stored on the
same block device as the data.
Now that all placement handles are available, we can use the full data
separation capabilities of the underlying storage, so that's good.
We can map out the placement handles to different write streams much
like we assign open zones for zoned storage and this opens up for
supporting data placement heuristics for a wider range use cases (not
just the RocksDB use case discussed here).
The big pieces that are missing from the FDP plumbing as I see it is
the ability to read reclaim unit size and syncing up the remaining
capacity of the placement units with the file system allocation
groups, but I guess that can be added later.
I've started benchmarking on the hardware I have at hand, iterating on
a good workload configuration. It will take some time to get to some
robust write amp measurements since the drives are very big and
require a painfully long warmup time.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 4:55 ` Christoph Hellwig
2024-10-30 15:41 ` Keith Busch
@ 2024-10-30 16:59 ` Bart Van Assche
2024-10-30 17:14 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-10-30 16:59 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, Hannes Reinecke,
Martin K . Petersen
On 10/29/24 9:55 PM, Christoph Hellwig wrote:
> For the temperature hints the only public user I known is rocksdb, and
> that only started working when Hans fixed a brown paperbag bug in the
> rocksdb code a while ago. Given that f2fs interprets the hints I suspect
> something in the Android world does as well, maybe Bart knows more.
UFS devices typically have less internal memory (SRAM) than the size of
a single zone. Hence, it helps UFS devices if it can be decided at the
time a write command is received where to send the data (SRAM, SLC NAND
or TLC NAND). This is why UFS vendors asked to provide data lifetime
information to zoned logical units. More information about UFS device
internals is available in this paper: Hwang, Joo-Young, Seokhwan Kim,
Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, Sangyeun Cho,
and Youjip Won. "{ZMS}: Zone Abstraction for Mobile Flash Storage." In
2024 USENIX Annual Technical Conference (USENIX ATC 24), pp. 173-189.
2024 (https://www.usenix.org/system/files/atc24-hwang.pdf).
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 16:59 ` Bart Van Assche
@ 2024-10-30 17:14 ` Christoph Hellwig
2024-10-30 17:44 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 17:14 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, Keith Busch, linux-block,
linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
javier.gonz, Hannes Reinecke, Martin K . Petersen
On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>
> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>> For the temperature hints the only public user I known is rocksdb, and
>> that only started working when Hans fixed a brown paperbag bug in the
>> rocksdb code a while ago. Given that f2fs interprets the hints I suspect
>> something in the Android world does as well, maybe Bart knows more.
>
> UFS devices typically have less internal memory (SRAM) than the size of a
> single zone.
That wasn't quite the question. Do you know what application in android
set the fcntl temperature hints?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 17:14 ` Christoph Hellwig
@ 2024-10-30 17:44 ` Bart Van Assche
2024-11-01 1:03 ` Jaegeuk Kim
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-10-30 17:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Keith Busch, linux-block, linux-nvme, linux-scsi,
io-uring, linux-fsdevel, joshi.k, javier.gonz, Hannes Reinecke,
Martin K . Petersen, Jaegeuk Kim
On 10/30/24 10:14 AM, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>>
>> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>>> For the temperature hints the only public user I known is rocksdb, and
>>> that only started working when Hans fixed a brown paperbag bug in the
>>> rocksdb code a while ago. Given that f2fs interprets the hints I suspect
>>> something in the Android world does as well, maybe Bart knows more.
>>
>> UFS devices typically have less internal memory (SRAM) than the size of a
>> single zone.
>
> That wasn't quite the question. Do you know what application in android
> set the fcntl temperature hints?
I do not know whether there are any Android apps that use the
F_SET_(FILE_|)RW_HINT fcntls.
The only use case in Android platform code I know of is this one: Daejun
Park, "f2fs-tools: add write hint support", f2fs-dev mailing list,
September 2024
(https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/).
As you probably know f2fs-tools is a software package that includes
fsck.f2fs.
Jaegeuk, please correct me if necessary.
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-30 17:44 ` Bart Van Assche
@ 2024-11-01 1:03 ` Jaegeuk Kim
0 siblings, 0 replies; 106+ messages in thread
From: Jaegeuk Kim @ 2024-11-01 1:03 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, Keith Busch, linux-block,
linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
javier.gonz, Hannes Reinecke, Martin K . Petersen
On 10/30, Bart Van Assche wrote:
> On 10/30/24 10:14 AM, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
> > >
> > > On 10/29/24 9:55 PM, Christoph Hellwig wrote:
> > > > For the temperature hints the only public user I known is rocksdb, and
> > > > that only started working when Hans fixed a brown paperbag bug in the
> > > > rocksdb code a while ago. Given that f2fs interprets the hints I suspect
> > > > something in the Android world does as well, maybe Bart knows more.
> > >
> > > UFS devices typically have less internal memory (SRAM) than the size of a
> > > single zone.
> >
> > That wasn't quite the question. Do you know what application in android
> > set the fcntl temperature hints?
>
> I do not know whether there are any Android apps that use the
> F_SET_(FILE_|)RW_HINT fcntls.
>
> The only use case in Android platform code I know of is this one: Daejun
> Park, "f2fs-tools: add write hint support", f2fs-dev mailing list,
> September 2024 (https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/).
> As you probably know f2fs-tools is a software package that includes
> fsck.f2fs.
>
> Jaegeuk, please correct me if necessary.
Yes, f2fs-tools in Android calls fcntl(fd, F_SET_RW_HINT, &hint);
>
> Bart.
>
>
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 15:26 ` Christoph Hellwig
2024-10-29 15:34 ` Keith Busch
@ 2024-10-29 17:18 ` Bart Van Assche
2024-10-30 5:42 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-10-29 17:18 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Keith Busch, Hannes Reinecke
On 10/29/24 8:26 AM, Christoph Hellwig wrote:
> Bart, btw: I think the current sd implementation is buggy as well, as
> it assumes the permanent streams are ordered by their data temperature
> in the IO Advise hints mode page, but I can't find anything in the
> spec that requires a particular ordering.
How about modifying sd_read_io_hints() such that permanent stream
information is ignored if the order of the RELATIVE LIFETIME information
reported by the GET STREAM STATUS command does not match the permanent
stream order?
Thanks,
Bart.
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 41e2dfa2d67d..277035febd82 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3192,7 +3192,12 @@ sd_read_cache_type(struct scsi_disk *sdkp,
unsigned char *buffer)
sdkp->DPOFUA = 0;
}
-static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int
stream_id)
+/*
+ * Returns the relative lifetime of a permanent stream. Returns -1 if the
+ * GET STREAM STATUS command fails or if the stream is not a permanent
stream.
+ */
+static int sd_perm_stream_rel_lifetime(struct scsi_disk *sdkp,
+ unsigned int stream_id)
{
u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS };
struct {
@@ -3212,14 +3217,16 @@ static bool sd_is_perm_stream(struct scsi_disk
*sdkp, unsigned int stream_id)
res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf),
SD_TIMEOUT, sdkp->max_retries, &exec_args);
if (res < 0)
- return false;
+ return -1;
if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr))
sd_print_sense_hdr(sdkp, &sshdr);
if (res)
- return false;
+ return -1;
if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status))
- return false;
- return buf.h.stream_status[0].perm;
+ return -1;
+ if (!buf.h.stream_status[0].perm)
+ return -1;
+ return buf.h.stream_status[0].rel_lifetime;
}
static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char
*buffer)
@@ -3247,9 +3254,17 @@ static void sd_read_io_hints(struct scsi_disk
*sdkp, unsigned char *buffer)
* should assign the lowest numbered stream identifiers to permanent
* streams.
*/
- for (desc = start; desc < end; desc++)
- if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start))
+ int prev_rel_lifetime = -1;
+ for (desc = start; desc < end; desc++) {
+ int rel_lifetime;
+
+ if (!desc->st_enble)
break;
+ rel_lifetime = sd_perm_stream_rel_lifetime(sdkp, desc - start);
+ if (rel_lifetime < 0 || rel_lifetime < prev_rel_lifetime)
+ break;
+ prev_rel_lifetime = rel_lifetime;
+ }
permanent_stream_count_old = sdkp->permanent_stream_count;
sdkp->permanent_stream_count = desc - start;
if (sdkp->rscs && sdkp->permanent_stream_count < 2)
^ permalink raw reply related [flat|nested] 106+ messages in thread
* Re: [PATCHv10 9/9] scsi: set permanent stream count in block limits
2024-10-29 17:18 ` Bart Van Assche
@ 2024-10-30 5:42 ` Christoph Hellwig
0 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-30 5:42 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
Keith Busch, Hannes Reinecke
On Tue, Oct 29, 2024 at 10:18:31AM -0700, Bart Van Assche wrote:
> On 10/29/24 8:26 AM, Christoph Hellwig wrote:
>> Bart, btw: I think the current sd implementation is buggy as well, as
>> it assumes the permanent streams are ordered by their data temperature
>> in the IO Advise hints mode page, but I can't find anything in the
>> spec that requires a particular ordering.
>
> How about modifying sd_read_io_hints() such that permanent stream
> information is ignored if the order of the RELATIVE LIFETIME information
> reported by the GET STREAM STATUS command does not match the permanent
> stream order?
That seems odd as there is nothing implying that they should be ordered.
The logic thing to do would be to a little array mapping the linux
temperature levels to the streams ids.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (8 preceding siblings ...)
2024-10-29 15:19 ` [PATCHv10 9/9] scsi: set permanent stream count in block limits Keith Busch
@ 2024-10-29 15:24 ` Christoph Hellwig
2024-11-05 15:50 ` Christoph Hellwig
10 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-10-29 15:24 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
joshi.k, javier.gonz, bvanassche, Keith Busch
On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote:
> Return invalid value if user requests an invalid write hint
>
> Added and exported a block device feature flag for indicating generic
> placement hint support
But it still talks of write hints everywhere and conflates the write
streams with the temperature hints which are completely different
beasts.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-10-29 15:19 [PATCHv10 0/9] write hints with nvme fdp, scsi streams Keith Busch
` (9 preceding siblings ...)
2024-10-29 15:24 ` [PATCHv10 0/9] write hints with nvme fdp, scsi streams Christoph Hellwig
@ 2024-11-05 15:50 ` Christoph Hellwig
2024-11-06 18:36 ` Keith Busch
2024-11-07 20:36 ` Keith Busch
10 siblings, 2 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-05 15:50 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, linux-scsi, io-uring, linux-fsdevel, hch,
joshi.k, javier.gonz, bvanassche, Keith Busch
I've pushed my branch that tries to make this work with the XFS
data separation here:
http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams
This is basically my current WIP xfs zoned (aka always write out place)
work optimistically destined for 6.14 + the patch set in this thread +
a little fix to make it work for nvme-multipath plus the tiny patch to
wire it up.
The good news is that the API from Keith mostly works. I don't really
know how to cope with the streams per partition bitmap, and I suspect
this will need to be dealt with a bit better. One option might be
to always have a bitmap, which would also support discontiguous
write stream numbers as actually supported by the underlying NVMe
implementation, another option would be to always map to consecutive
numbers.
The bad news is that for file systems or applications to make full use
of the API we also really need an API to expose how much space is left
in a write stream, as otherwise they can easily get out of sync on
a power fail. I've left that code in as a TODO, it should not affect
basic testing.
We get the same kind of performance numbers as the ZNS support on
comparable hardware platforms, which is expected. Testing on an
actual state of the art non-prototype hardware will take more time
as the capacities are big enough that getting serious numbers will
take a lot more time.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-05 15:50 ` Christoph Hellwig
@ 2024-11-06 18:36 ` Keith Busch
2024-11-07 20:36 ` Keith Busch
1 sibling, 0 replies; 106+ messages in thread
From: Keith Busch @ 2024-11-06 18:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote:
> I've pushed my branch that tries to make this work with the XFS
> data separation here:
>
> http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams
>
> This is basically my current WIP xfs zoned (aka always write out place)
> work optimistically destined for 6.14 + the patch set in this thread +
> a little fix to make it work for nvme-multipath plus the tiny patch to
> wire it up.
>
> The good news is that the API from Keith mostly works. I don't really
> know how to cope with the streams per partition bitmap, and I suspect
> this will need to be dealt with a bit better. One option might be
> to always have a bitmap, which would also support discontiguous
> write stream numbers as actually supported by the underlying NVMe
> implementation, another option would be to always map to consecutive
> numbers.
Thanks for sharing that. Seeing the code makes it much easier to
understand where you're trying to steer this. I'll take a look and
probably have some feedback after a couple days going through it.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-05 15:50 ` Christoph Hellwig
2024-11-06 18:36 ` Keith Busch
@ 2024-11-07 20:36 ` Keith Busch
2024-11-08 14:18 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-11-07 20:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche
On Tue, Nov 05, 2024 at 04:50:14PM +0100, Christoph Hellwig wrote:
> I've pushed my branch that tries to make this work with the XFS
> data separation here:
>
> http://git.infradead.org/?p=users/hch/xfs.git;a=shortlog;h=refs/heads/xfs-zoned-streams
The zone block support all looks pretty neat, but I think you're making
this harder than necessary to support streams. You don't need to treat
these like a sequential write device. The controller side does its own
garbage collection, so no need to duplicate the effort on the host. And
it looks like the host side gc potentially merges multiple streams into
a single gc stream, so that's probably not desirable.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-07 20:36 ` Keith Busch
@ 2024-11-08 14:18 ` Christoph Hellwig
2024-11-08 15:51 ` Keith Busch
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-08 14:18 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche
On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote:
> The zone block support all looks pretty neat, but I think you're making
> this harder than necessary to support streams. You don't need to treat
> these like a sequential write device. The controller side does its own
> garbage collection, so no need to duplicate the effort on the host. And
> it looks like the host side gc potentially merges multiple streams into
> a single gc stream, so that's probably not desirable.
We're not really duplicating much. Writing sequential is pretty easy,
and tracking reclaim units separately means you need another tracking
data structure, and either that or the LBA one is always going to be
badly fragmented if they aren't the same.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 14:18 ` Christoph Hellwig
@ 2024-11-08 15:51 ` Keith Busch
2024-11-08 16:54 ` Matthew Wilcox
2024-11-11 6:48 ` Christoph Hellwig
0 siblings, 2 replies; 106+ messages in thread
From: Keith Busch @ 2024-11-08 15:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, linux-scsi, io-uring,
linux-fsdevel, joshi.k, javier.gonz, bvanassche
On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 07, 2024 at 01:36:35PM -0700, Keith Busch wrote:
> > The zone block support all looks pretty neat, but I think you're making
> > this harder than necessary to support streams. You don't need to treat
> > these like a sequential write device. The controller side does its own
> > garbage collection, so no need to duplicate the effort on the host. And
> > it looks like the host side gc potentially merges multiple streams into
> > a single gc stream, so that's probably not desirable.
>
> We're not really duplicating much. Writing sequential is pretty easy,
> and tracking reclaim units separately means you need another tracking
> data structure, and either that or the LBA one is always going to be
> badly fragmented if they aren't the same.
You're getting fragmentation anyway, which is why you had to implement
gc. You're just shifting who gets to deal with it from the controller to
the host. The host is further from the media, so you're starting from a
disadvantage. The host gc implementation would have to be quite a bit
better to justify the link and memory usage necessary for the copies
(...queue a copy-offload discussion? oom?).
This xfs implementation also has logic to recover from a power fail. The
device already does that if you use the LBA abstraction instead of
tracking sequential write pointers and free blocks.
I think you are underestimating the duplication of efforts going on
here.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 15:51 ` Keith Busch
@ 2024-11-08 16:54 ` Matthew Wilcox
2024-11-08 17:43 ` Javier Gonzalez
2024-11-11 6:49 ` Christoph Hellwig
2024-11-11 6:48 ` Christoph Hellwig
1 sibling, 2 replies; 106+ messages in thread
From: Matthew Wilcox @ 2024-11-08 16:54 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> > We're not really duplicating much. Writing sequential is pretty easy,
> > and tracking reclaim units separately means you need another tracking
> > data structure, and either that or the LBA one is always going to be
> > badly fragmented if they aren't the same.
>
> You're getting fragmentation anyway, which is why you had to implement
> gc. You're just shifting who gets to deal with it from the controller to
> the host. The host is further from the media, so you're starting from a
> disadvantage. The host gc implementation would have to be quite a bit
> better to justify the link and memory usage necessary for the copies
> (...queue a copy-offload discussion? oom?).
But the filesystem knows which blocks are actually in use. Sending
TRIM/DISCARD information to the drive at block-level granularity hasn't
worked out so well in the past. So the drive is the one at a disadvantage
because it has to copy blocks which aren't actually in use.
I like the idea of using copy-offload though.
^ permalink raw reply [flat|nested] 106+ messages in thread
* RE: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 16:54 ` Matthew Wilcox
@ 2024-11-08 17:43 ` Javier Gonzalez
2024-11-08 18:51 ` Bart Van Assche
2024-11-11 6:51 ` Christoph Hellwig
2024-11-11 6:49 ` Christoph Hellwig
1 sibling, 2 replies; 106+ messages in thread
From: Javier Gonzalez @ 2024-11-08 17:43 UTC (permalink / raw)
To: Matthew Wilcox, Keith Busch
Cc: Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
> -----Original Message-----
> From: Matthew Wilcox <[email protected]>
> Sent: Friday, November 8, 2024 5:55 PM
> To: Keith Busch <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Keith Busch <[email protected]>; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> Javier Gonzalez <[email protected]>; [email protected]
> Subject: Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
>
> On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> > On Fri, Nov 08, 2024 at 03:18:52PM +0100, Christoph Hellwig wrote:
> > > We're not really duplicating much. Writing sequential is pretty easy,
> > > and tracking reclaim units separately means you need another tracking
> > > data structure, and either that or the LBA one is always going to be
> > > badly fragmented if they aren't the same.
> >
> > You're getting fragmentation anyway, which is why you had to implement
> > gc. You're just shifting who gets to deal with it from the controller to
> > the host. The host is further from the media, so you're starting from a
> > disadvantage. The host gc implementation would have to be quite a bit
> > better to justify the link and memory usage necessary for the copies
> > (...queue a copy-offload discussion? oom?).
>
> But the filesystem knows which blocks are actually in use. Sending
> TRIM/DISCARD information to the drive at block-level granularity hasn't
> worked out so well in the past. So the drive is the one at a disadvantage
> because it has to copy blocks which aren't actually in use.
It is true that trim has not been great. I would say that at least enterprise
SSDs have fixed this in general. For FDP, DSM Deallocate is respected, which
Provides a good "erase" interface to the host.
It is true though that this is not properly described in the spec and we should
fix it.
>
> I like the idea of using copy-offload though.
We have been iterating in the patches for years, but it is unfortunately
one of these series that go in circles forever. I don't think it is due
to any specific problem, but mostly due to unaligned requests form
different folks reviewing. Last time I talked to Damien he asked me to
send the patches again; we have not followed through due to bandwidth.
If there is an interest, we can re-spin this again...
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 17:43 ` Javier Gonzalez
@ 2024-11-08 18:51 ` Bart Van Assche
2024-11-11 9:31 ` Javier Gonzalez
2024-11-11 6:51 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-11-08 18:51 UTC (permalink / raw)
To: Javier Gonzalez, Matthew Wilcox, Keith Busch
Cc: Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11/8/24 9:43 AM, Javier Gonzalez wrote:
> If there is an interest, we can re-spin this again...
I'm interested. Work is ongoing in JEDEC on support for copy offloading
for UFS devices. This work involves standardizing which SCSI copy
offloading features should be supported and which features are not
required. Implementations are expected to be available soon.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 18:51 ` Bart Van Assche
@ 2024-11-11 9:31 ` Javier Gonzalez
2024-11-11 17:45 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Javier Gonzalez @ 2024-11-11 9:31 UTC (permalink / raw)
To: Bart Van Assche
Cc: Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 08.11.2024 10:51, Bart Van Assche wrote:
>On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>If there is an interest, we can re-spin this again...
>
>I'm interested. Work is ongoing in JEDEC on support for copy offloading
>for UFS devices. This work involves standardizing which SCSI copy
>offloading features should be supported and which features are not
>required. Implementations are expected to be available soon.
>
Do you have any specific blockers on the last series? I know you have
left comments in many of the patches already, but I think we are all a
bit confused on where we are ATM.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:31 ` Javier Gonzalez
@ 2024-11-11 17:45 ` Bart Van Assche
2024-11-12 13:52 ` Nitesh Shetty
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-11-11 17:45 UTC (permalink / raw)
To: Javier Gonzalez
Cc: Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 11/11/24 1:31 AM, Javier Gonzalez wrote:
> On 08.11.2024 10:51, Bart Van Assche wrote:
>> On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>> If there is an interest, we can re-spin this again...
>>
>> I'm interested. Work is ongoing in JEDEC on support for copy offloading
>> for UFS devices. This work involves standardizing which SCSI copy
>> offloading features should be supported and which features are not
>> required. Implementations are expected to be available soon.
>
> Do you have any specific blockers on the last series? I know you have
> left comments in many of the patches already, but I think we are all a
> bit confused on where we are ATM.
Nobody replied to this question that was raised 4 months ago:
https://lore.kernel.org/linux-block/[email protected]/
I think we need to agree about the answer to that question before we can
continue with implementing copy offloading.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 17:45 ` Bart Van Assche
@ 2024-11-12 13:52 ` Nitesh Shetty
2024-11-19 2:03 ` Martin K. Petersen
0 siblings, 1 reply; 106+ messages in thread
From: Nitesh Shetty @ 2024-11-12 13:52 UTC (permalink / raw)
To: Bart Van Assche
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
[-- Attachment #1: Type: text/plain, Size: 2766 bytes --]
On 11/11/24 09:45AM, Bart Van Assche wrote:
>On 11/11/24 1:31 AM, Javier Gonzalez wrote:
>>On 08.11.2024 10:51, Bart Van Assche wrote:
>>>On 11/8/24 9:43 AM, Javier Gonzalez wrote:
>>>>If there is an interest, we can re-spin this again...
>>>
>>>I'm interested. Work is ongoing in JEDEC on support for copy offloading
>>>for UFS devices. This work involves standardizing which SCSI copy
>>>offloading features should be supported and which features are not
>>>required. Implementations are expected to be available soon.
>>
>>Do you have any specific blockers on the last series? I know you have
>>left comments in many of the patches already, but I think we are all a
>>bit confused on where we are ATM.
>
>Nobody replied to this question that was raised 4 months ago:
>https://lore.kernel.org/linux-block/[email protected]/
>
>I think we need to agree about the answer to that question before we can
>continue with implementing copy offloading.
>
Yes, even I feel the same.
Blocker with copy has been how we should plumb things in block layer.
A couple of approaches we tried in the past[1].
Restating for reference,
1.payload based approach:
a. Based on Mikulas patch, here a common payload is used for both source
and destination bio.
b. Initially we send source bio, upon reaching driver we update payload
and complete the bio.
c. Send destination bio, in driver layer we recover the source info
from the payload and send the copy command to device.
Drawback:
Request payload contains IO information rather than data.
Based on past experience Christoph and Bart suggested not a good way
forward.
Alternate suggestion from Christoph was to used separate BIOs for src
and destination and match them using token/id.
As Bart pointed, I find it hard how to match when the IO split happens.
2. Plug based approach:
a. Take a plug, send destination bio, form request and wait for src bio
b. send source bio, merge with destination bio
c. Upon release of plug send request down to driver.
Drawback:
Doesn't work for stacked devices which has async submission.
Bart suggested this is not good solution overall.
Alternate suggestion was to use list based approach.
But we observed lifetime management problems, especially in
failure handling.
3. Single bio approach:
a. Use single bio to represent both src and dst info.
b. Use abnormal IO handling similar to discard.
Drawback:
Christoph pointed out that, this will have issue of payload containing
information for both IO stack and wire.
I am really torn on how to proceed further ?
-- Nitesh Shetty
[1] https://lore.kernel.org/linux-block/[email protected]@samsung.com/
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-12 13:52 ` Nitesh Shetty
@ 2024-11-19 2:03 ` Martin K. Petersen
2024-11-25 23:21 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Martin K. Petersen @ 2024-11-19 2:03 UTC (permalink / raw)
To: Nitesh Shetty
Cc: Bart Van Assche, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
Nitesh,
> 1.payload based approach:
> a. Based on Mikulas patch, here a common payload is used for both source
> and destination bio.
> b. Initially we send source bio, upon reaching driver we update payload
> and complete the bio.
> c. Send destination bio, in driver layer we recover the source info
> from the payload and send the copy command to device.
>
> Drawback:
> Request payload contains IO information rather than data.
> Based on past experience Christoph and Bart suggested not a good way
> forward.
> Alternate suggestion from Christoph was to used separate BIOs for src
> and destination and match them using token/id.
> As Bart pointed, I find it hard how to match when the IO split happens.
In my experience the payload-based approach was what made things work. I
tried many things before settling on that. Also note that to support
token-based SCSI devices, you inevitably need to separate the
read/copy_in operation from the write/copy_out ditto and carry the token
in the payload.
For "single copy command" devices, you can just synthesize the token in
the driver. Although I don't really know what the point of the token is
in that case because as far as I'm concerned, the only interesting
information is that the read/copy_in operation made it down the stack
without being split.
Handling splits made things way too complicated for my taste. Especially
with a potential many-to-many mapping. Better to just fall back to
regular read/writes if either the copy_in or the copy_out operation
needs to be split. If your stacked storage is configured with a
prohibitively small stripe chunk size, then your copy performance is
just going to be approaching that of a regular read/write data movement.
Not a big deal as far as I'm concerned...
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-19 2:03 ` Martin K. Petersen
@ 2024-11-25 23:21 ` Bart Van Assche
2024-11-27 2:54 ` Martin K. Petersen
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-11-25 23:21 UTC (permalink / raw)
To: Martin K. Petersen, Nitesh Shetty
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11/18/24 6:03 PM, Martin K. Petersen wrote:
> In my experience the payload-based approach was what made things work. I
> tried many things before settling on that. Also note that to support
> token-based SCSI devices, you inevitably need to separate the
> read/copy_in operation from the write/copy_out ditto and carry the token
> in the payload.
>
> For "single copy command" devices, you can just synthesize the token in
> the driver. Although I don't really know what the point of the token is
> in that case because as far as I'm concerned, the only interesting
> information is that the read/copy_in operation made it down the stack
> without being split.
Hi Martin,
There are some strong arguments in this thread from May 2024 in favor of
representing the entire copy operation as a single REQ_OP_ operation:
https://lore.kernel.org/linux-block/[email protected]/
Token-based copy offloading (called ODX by Microsoft) could be
implemented by maintaining a state machine in the SCSI sd driver and
using a single block layer request to submit the four following SCSI
commands:
* POPULATE TOKEN
* RECEIVE ROD TOKEN INFORMATION
* WRITE USING TOKEN
I'm assuming that the IMMED bit will be set to zero in the WRITE USING
TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN
INFORMATION commands would be required to poll for the WRITE USING TOKEN
completion status.
I guess that the block layer maintainer wouldn't be happy if all block
drivers would have to deal with three or four phases for copy offloading
just because ODX is this complicated.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-25 23:21 ` Bart Van Assche
@ 2024-11-27 2:54 ` Martin K. Petersen
2024-11-27 18:42 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Martin K. Petersen @ 2024-11-27 2:54 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Bart,
> There are some strong arguments in this thread from May 2024 in favor of
> representing the entire copy operation as a single REQ_OP_ operation:
> https://lore.kernel.org/linux-block/[email protected]/
As has been discussed many times, a copy operation is semantically a
read operation followed by a write operation. And, based on my
experience implementing support for both types of copy offload in Linux,
what made things elegant was treating the operation as a read followed
by a write throughout the stack. Exactly like the token-based offload
specification describes.
> Token-based copy offloading (called ODX by Microsoft) could be
> implemented by maintaining a state machine in the SCSI sd driver
I suspect the SCSI maintainer would object strongly to the idea of
maintaining cross-device copy offload state and associated object
lifetime issues in the sd driver.
> I'm assuming that the IMMED bit will be set to zero in the WRITE USING
> TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN
> INFORMATION commands would be required to poll for the WRITE USING TOKEN
> completion status.
What would the benefit of making WRITE USING TOKEN be a background
operation? That seems like a completely unnecessary complication.
> I guess that the block layer maintainer wouldn't be happy if all block
> drivers would have to deal with three or four phases for copy
> offloading just because ODX is this complicated.
Last I looked, EXTENDED COPY consumed something like 70 pages in the
spec. Token-based copy is trivially simple and elegant by comparison.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 2:54 ` Martin K. Petersen
@ 2024-11-27 18:42 ` Bart Van Assche
2024-11-27 20:14 ` Martin K. Petersen
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-11-27 18:42 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Nitesh Shetty, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11/26/24 6:54 PM, Martin K. Petersen wrote:
> Bart wrote:
>> There are some strong arguments in this thread from May 2024 in favor of
>> representing the entire copy operation as a single REQ_OP_ operation:
>> https://lore.kernel.org/linux-block/[email protected]/
>
> As has been discussed many times, a copy operation is semantically a
> read operation followed by a write operation. And, based on my
> experience implementing support for both types of copy offload in Linux,
> what made things elegant was treating the operation as a read followed
> by a write throughout the stack. Exactly like the token-based offload
> specification describes.
Submitting a copy operation as two bios or two requests means that there
is a risk that one of the two operations never reaches the block driver
at the bottom of the storage stack and hence that a deadlock occurs. I
prefer not to introduce any mechanisms that can cause a deadlock.
As one can see here, Damien Le Moal and Keith Busch both prefer to
submit copy operations as a single operation: Keith Busch, Re: [PATCH
v20 02/12] Add infrastructure for copy offload in block and request
layer, linux-block mailing list, 2024-06-24
(https://lore.kernel.org/all/[email protected]/).
>> Token-based copy offloading (called ODX by Microsoft) could be
>> implemented by maintaining a state machine in the SCSI sd driver
>
> I suspect the SCSI maintainer would object strongly to the idea of
> maintaining cross-device copy offload state and associated object
> lifetime issues in the sd driver.
Such information wouldn't have to be maintained inside the sd driver. A
new kernel module could be introduced that tracks the state of copy
operations and that interacts with the sd driver.
>> I'm assuming that the IMMED bit will be set to zero in the WRITE USING
>> TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN
>> INFORMATION commands would be required to poll for the WRITE USING TOKEN
>> completion status.
>
> What would the benefit of making WRITE USING TOKEN be a background
> operation? That seems like a completely unnecessary complication.
If a single copy operation takes significantly more time than the time
required to switch between power states, power can be saved by using
IMMED=1. Mechanisms like run-time power management (RPM) or the UFS host
controller auto-hibernation mechanism can only be activated if no
commands are in progress. With IMMED=0, the link between the host and
the storage device will remain powered as long as the copy operation is
in progress. With IMMED=1, the link between the host and the storage
device can be powered down after the copy operation has been submitted
until the host decides to check whether or not the copy operation has
completed.
>> I guess that the block layer maintainer wouldn't be happy if all block
>> drivers would have to deal with three or four phases for copy
>> offloading just because ODX is this complicated.
>
> Last I looked, EXTENDED COPY consumed something like 70 pages in the
> spec. Token-based copy is trivially simple and elegant by comparison.
I don't know of any storage device vendor who has implemented all
EXTENDED COPY features that have been standardized. Assuming that 50
lines of code fit on a single page, here is an example of an EXTENDED
COPY implementation that can be printed on 21 pages of paper:
$ wc -l drivers/target/target_core_xcopy.c
1041
$ echo $(((1041 + 49) / 50))
21
The advantages of EXTENDED COPY over ODX are as follows:
- EXTENDED COPY is a single SCSI command and hence better suited for
devices with a limited queue depth. While the UFS 3.0 standard
restricts the queue depth to 32, most UFS 4.0 devices support a
queue depth of 64.
- The latency of setting up a copy command with EXTENDED COPY is
lower since only a single command has to be sent to the device.
ODX requires three round-trips to the device (assuming IMMED=0).
- EXTENDED COPY requires less memory in storage devices. Each ODX
token occupies some memory and the rules around token lifetimes
are nontrivial.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 18:42 ` Bart Van Assche
@ 2024-11-27 20:14 ` Martin K. Petersen
2024-11-27 21:06 ` Bart Van Assche
` (3 more replies)
0 siblings, 4 replies; 106+ messages in thread
From: Martin K. Petersen @ 2024-11-27 20:14 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Bart,
> Submitting a copy operation as two bios or two requests means that
> there is a risk that one of the two operations never reaches the block
> driver at the bottom of the storage stack and hence that a deadlock
> occurs. I prefer not to introduce any mechanisms that can cause a
> deadlock.
How do you copy a block range without offload? You perform a READ to
read the data into memory. And once the READ completes, you do a WRITE
of the data to the new location.
Token-based copy offload works exactly the same way. You do a POPULATE
TOKEN which is identical to a READ except you get a cookie instead of
the actual data. And then once you have the cookie, you perform a WRITE
USING TOKEN to perform the write operation. Semantically, it's exactly
the same as a normal copy except for the lack of data movement. That's
the whole point!
Once I had support for token-based copy offload working, it became clear
to me that this approach is much simpler than pointer matching, bio
pairs, etc. The REQ_OP_COPY_IN operation and the REQ_OP_COPY_OUT
operation are never in flight at the same time. There are no
synchronization hassles, no lifetimes, no lookup tables in the sd
driver, no nonsense. Semantically, it's a read followed by a write.
For devices that implement single-command copy offload, the
REQ_OP_COPY_IN operation only serves as a validation that no splitting
took place. Once the bio reaches the ULD, the I/O is completed without
ever sending a command to the device. blk-lib then issues a
REQ_OP_COPY_OUT which gets turned into EXTENDED COPY or NVMe Copy and
sent to the destination device.
Aside from making things trivially simple, the COPY_IN/COPY_OUT semantic
is a *requirement* for token-based offload devices. Why would we even
consider having two incompatible sets of copy offload semantics coexist
in the block layer?
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 20:14 ` Martin K. Petersen
@ 2024-11-27 21:06 ` Bart Van Assche
2024-11-28 2:09 ` Martin K. Petersen
2024-11-28 3:24 ` Christoph Hellwig
` (2 subsequent siblings)
3 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-11-27 21:06 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Nitesh Shetty, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11/27/24 12:14 PM, Martin K. Petersen wrote:
> Once I had support for token-based copy offload working, it became clear
> to me that this approach is much simpler than pointer matching, bio
> pairs, etc. The REQ_OP_COPY_IN operation and the REQ_OP_COPY_OUT
> operation are never in flight at the same time. There are no
> synchronization hassles, no lifetimes, no lookup tables in the sd
> driver, no nonsense. Semantically, it's a read followed by a write.
What if the source LBA range does not require splitting but the
destination LBA range requires splitting, e.g. because it crosses a
chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in
this case and the REQ_OP_COPY_OUT operation fail? Does this mean that a
third operation is needed to cancel REQ_OP_COPY_IN operations if the
REQ_OP_COPY_OUT operation fails?
Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a
large number of REQ_OP_COPY_IN operations is submitted without
corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism required
to discard unmatched REQ_OP_COPY_IN operations after a certain time?
> Aside from making things trivially simple, the COPY_IN/COPY_OUT semantic
> is a *requirement* for token-based offload devices.
Hmm ... we may each have a different opinion about whether or not the
COPY_IN/COPY_OUT semantics are a requirement for token-based copy
offloading.
Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for
ODX devices is that simple. The COPY_IN and COPY_OUT operations have
to be translated into three SCSI commands, isn't it? I'm referring to
the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING TOKEN
commands. What is your opinion about how to translate the two block
layer operations into these three SCSI commands?
> Why would we even consider having two incompatible sets of copy
> offload semantics coexist in the block layer?
I am not aware of any proposal to implement two sets of copy operations
in the block layer. All proposals I have seen so far involve adding a
single set of copy operations to the block layer. Opinions differ
however about whether to add a single copy operation primitive or
separate IN and OUT primitives.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 21:06 ` Bart Van Assche
@ 2024-11-28 2:09 ` Martin K. Petersen
2024-11-28 8:51 ` Damien Le Moal
0 siblings, 1 reply; 106+ messages in thread
From: Martin K. Petersen @ 2024-11-28 2:09 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Bart,
> What if the source LBA range does not require splitting but the
> destination LBA range requires splitting, e.g. because it crosses a
> chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in
> this case and the REQ_OP_COPY_OUT operation fail?
Yes.
I experimented with approaching splitting in an iterative fashion. And
thus, if there was a split halfway through the COPY_IN I/O, we'd issue a
corresponding COPY_OUT up to the split point and hope that the write
subsequently didn't need a split. And then deal with the next segment.
However, given that copy offload offers diminishing returns for small
I/Os, it was not worth the hassle for the devices I used for
development. It was cleaner and faster to just fall back to regular
read/write when a split was required.
> Does this mean that a third operation is needed to cancel
> REQ_OP_COPY_IN operations if the REQ_OP_COPY_OUT operation fails?
No. The device times out the token.
> Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a
> large number of REQ_OP_COPY_IN operations is submitted without
> corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism
> required to discard unmatched REQ_OP_COPY_IN operations after a
> certain time?
See above.
For your EXTENDED COPY use case there is no token and thus the COPY_IN
completes immediately.
And for the token case, if you populate a million tokens and don't use
them before they time out, it sounds like your submitting code is badly
broken. But it doesn't matter because there are no I/Os in flight and
thus nothing to discard.
> Hmm ... we may each have a different opinion about whether or not the
> COPY_IN/COPY_OUT semantics are a requirement for token-based copy
> offloading.
Maybe. But you'll have a hard time convincing me to add any kind of
state machine or bio matching magic to the SCSI stack when the simplest
solution is to treat copying like a read followed by a write. There is
no concurrency, no kernel state, no dependency between two commands, nor
two scsi_disk/scsi_device object lifetimes to manage.
> Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for
> ODX devices is that simple. The COPY_IN and COPY_OUT operations have
> to be translated into three SCSI commands, isn't it? I'm referring to
> the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING
> TOKEN commands. What is your opinion about how to translate the two
> block layer operations into these three SCSI commands?
COPY_IN is translated to a NOP for devices implementing EXTENDED COPY
and a POPULATE TOKEN for devices using tokens.
COPY_OUT is translated to an EXTENDED COPY (or NVMe Copy) for devices
using a single command approach and WRITE USING TOKEN for devices using
tokens.
There is no need for RECEIVE ROD TOKEN INFORMATION.
I am not aware of UFS devices using the token-based approach. And for
EXTENDED COPY there is only a single command sent to the device. If you
want to do power management while that command is being processed,
please deal with that in UFS. The block layer doesn't deal with the
async variants of any of the other SCSI commands either...
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-28 2:09 ` Martin K. Petersen
@ 2024-11-28 8:51 ` Damien Le Moal
2024-11-29 6:19 ` Christoph Hellwig
0 siblings, 1 reply; 106+ messages in thread
From: Damien Le Moal @ 2024-11-28 8:51 UTC (permalink / raw)
To: Martin K. Petersen, Bart Van Assche
Cc: Nitesh Shetty, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11/28/24 11:09, Martin K. Petersen wrote:
>
> Bart,
>
>> What if the source LBA range does not require splitting but the
>> destination LBA range requires splitting, e.g. because it crosses a
>> chunk_sectors boundary? Will the REQ_OP_COPY_IN operation succeed in
>> this case and the REQ_OP_COPY_OUT operation fail?
>
> Yes.
>
> I experimented with approaching splitting in an iterative fashion. And
> thus, if there was a split halfway through the COPY_IN I/O, we'd issue a
> corresponding COPY_OUT up to the split point and hope that the write
> subsequently didn't need a split. And then deal with the next segment.
>
> However, given that copy offload offers diminishing returns for small
> I/Os, it was not worth the hassle for the devices I used for
> development. It was cleaner and faster to just fall back to regular
> read/write when a split was required.
>
>> Does this mean that a third operation is needed to cancel
>> REQ_OP_COPY_IN operations if the REQ_OP_COPY_OUT operation fails?
>
> No. The device times out the token.
>
>> Additionally, how to handle bugs in REQ_OP_COPY_* submitters where a
>> large number of REQ_OP_COPY_IN operations is submitted without
>> corresponding REQ_OP_COPY_OUT operation? Is perhaps a mechanism
>> required to discard unmatched REQ_OP_COPY_IN operations after a
>> certain time?
>
> See above.
>
> For your EXTENDED COPY use case there is no token and thus the COPY_IN
> completes immediately.
>
> And for the token case, if you populate a million tokens and don't use
> them before they time out, it sounds like your submitting code is badly
> broken. But it doesn't matter because there are no I/Os in flight and
> thus nothing to discard.
>
>> Hmm ... we may each have a different opinion about whether or not the
>> COPY_IN/COPY_OUT semantics are a requirement for token-based copy
>> offloading.
>
> Maybe. But you'll have a hard time convincing me to add any kind of
> state machine or bio matching magic to the SCSI stack when the simplest
> solution is to treat copying like a read followed by a write. There is
> no concurrency, no kernel state, no dependency between two commands, nor
> two scsi_disk/scsi_device object lifetimes to manage.
And that also would allow supporting a fake copy offload with regular read/write
BIOs very easily, I think. So all block devices can be presented as supporting
"copy offload". That is nice for FSes.
>
>> Additionally, I'm not convinced that implementing COPY_IN/COPY_OUT for
>> ODX devices is that simple. The COPY_IN and COPY_OUT operations have
>> to be translated into three SCSI commands, isn't it? I'm referring to
>> the POPULATE TOKEN, RECEIVE ROD TOKEN INFORMATION and WRITE USING
>> TOKEN commands. What is your opinion about how to translate the two
>> block layer operations into these three SCSI commands?
>
> COPY_IN is translated to a NOP for devices implementing EXTENDED COPY
> and a POPULATE TOKEN for devices using tokens.
>
> COPY_OUT is translated to an EXTENDED COPY (or NVMe Copy) for devices
> using a single command approach and WRITE USING TOKEN for devices using
> tokens.
ATA WRITE GATHERED command is also a single copy command. That matches and while
I have not checked SAT, translation would likely work.
While I was initially worried that the 2 BIO based approach would be overly
complicated, it seems that I was wrong :)
>
> There is no need for RECEIVE ROD TOKEN INFORMATION.
>
> I am not aware of UFS devices using the token-based approach. And for
> EXTENDED COPY there is only a single command sent to the device. If you
> want to do power management while that command is being processed,
> please deal with that in UFS. The block layer doesn't deal with the
> async variants of any of the other SCSI commands either...
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-28 8:51 ` Damien Le Moal
@ 2024-11-29 6:19 ` Christoph Hellwig
2024-11-29 6:23 ` Damien Le Moal
0 siblings, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-29 6:19 UTC (permalink / raw)
To: Damien Le Moal
Cc: Martin K. Petersen, Bart Van Assche, Nitesh Shetty,
Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Thu, Nov 28, 2024 at 05:51:52PM +0900, Damien Le Moal wrote:
> > Maybe. But you'll have a hard time convincing me to add any kind of
> > state machine or bio matching magic to the SCSI stack when the simplest
> > solution is to treat copying like a read followed by a write. There is
> > no concurrency, no kernel state, no dependency between two commands, nor
> > two scsi_disk/scsi_device object lifetimes to manage.
>
> And that also would allow supporting a fake copy offload with regular
> read/write BIOs very easily, I think. So all block devices can be
> presented as supporting "copy offload". That is nice for FSes.
Just as when that showed up in one of the last copy offload series
I'm still very critical of a stateless copy offload emulation. The
reason for that is that a host based copy helper needs scratch space
to read into, and doing these large allocation on every copy puts a
lot of pressure onto the allocator. Allocating the buffer once at
mount time and the just cycling through it is generally a lot more
efficient.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-29 6:19 ` Christoph Hellwig
@ 2024-11-29 6:23 ` Damien Le Moal
0 siblings, 0 replies; 106+ messages in thread
From: Damien Le Moal @ 2024-11-29 6:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Martin K. Petersen, Bart Van Assche, Nitesh Shetty,
Javier Gonzalez, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 11/29/24 15:19, Christoph Hellwig wrote:
> On Thu, Nov 28, 2024 at 05:51:52PM +0900, Damien Le Moal wrote:
>>> Maybe. But you'll have a hard time convincing me to add any kind of
>>> state machine or bio matching magic to the SCSI stack when the simplest
>>> solution is to treat copying like a read followed by a write. There is
>>> no concurrency, no kernel state, no dependency between two commands, nor
>>> two scsi_disk/scsi_device object lifetimes to manage.
>>
>> And that also would allow supporting a fake copy offload with regular
>> read/write BIOs very easily, I think. So all block devices can be
>> presented as supporting "copy offload". That is nice for FSes.
>
> Just as when that showed up in one of the last copy offload series
> I'm still very critical of a stateless copy offload emulation. The
> reason for that is that a host based copy helper needs scratch space
> to read into, and doing these large allocation on every copy puts a
> lot of pressure onto the allocator. Allocating the buffer once at
> mount time and the just cycling through it is generally a lot more
> efficient.
Sure, that sounds good. My point was that it seems that a token based copy
offload design makes it relatively easy to emulate copy in software for devices
that do not support copy offload in hardware. That emulation can certainly be
implemented using a single buffer like you suggest.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 20:14 ` Martin K. Petersen
2024-11-27 21:06 ` Bart Van Assche
@ 2024-11-28 3:24 ` Christoph Hellwig
2024-11-28 15:21 ` Keith Busch
[not found] ` <CGME20241205081138epcas5p2a47090e70c3cf19e562f63cd9fc495d1@epcas5p2.samsung.com>
3 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-28 3:24 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Bart Van Assche, Nitesh Shetty, Javier Gonzalez, Matthew Wilcox,
Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Wed, Nov 27, 2024 at 03:14:09PM -0500, Martin K. Petersen wrote:
> How do you copy a block range without offload? You perform a READ to
> read the data into memory. And once the READ completes, you do a WRITE
> of the data to the new location.
Yes. I.e. this is code that makes this pattern very clearm and for
which I'd love to be able to use copy offload when available:
http://git.infradead.org/?p=users/hch/xfs.git;a=blob;f=fs/xfs/xfs_zone_gc.c;h=ed8aa08b3c18d50afe79326e697d83e09542a9b6;hb=refs/heads/xfs-zoned#l820
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-27 20:14 ` Martin K. Petersen
2024-11-27 21:06 ` Bart Van Assche
2024-11-28 3:24 ` Christoph Hellwig
@ 2024-11-28 15:21 ` Keith Busch
2024-11-28 16:40 ` Christoph Hellwig
[not found] ` <CGME20241205081138epcas5p2a47090e70c3cf19e562f63cd9fc495d1@epcas5p2.samsung.com>
3 siblings, 1 reply; 106+ messages in thread
From: Keith Busch @ 2024-11-28 15:21 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Bart Van Assche, Nitesh Shetty, Javier Gonzalez, Matthew Wilcox,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Wed, Nov 27, 2024 at 03:14:09PM -0500, Martin K. Petersen wrote:
>
> Bart,
>
> > Submitting a copy operation as two bios or two requests means that
> > there is a risk that one of the two operations never reaches the block
> > driver at the bottom of the storage stack and hence that a deadlock
> > occurs. I prefer not to introduce any mechanisms that can cause a
> > deadlock.
>
> How do you copy a block range without offload? You perform a READ to
> read the data into memory. And once the READ completes, you do a WRITE
> of the data to the new location.
>
> Token-based copy offload works exactly the same way. You do a POPULATE
> TOKEN which is identical to a READ except you get a cookie instead of
> the actual data. And then once you have the cookie, you perform a WRITE
> USING TOKEN to perform the write operation. Semantically, it's exactly
> the same as a normal copy except for the lack of data movement. That's
> the whole point!
I think of copy a little differently. When you do a normal write
command, the host provides the controller a vector of sources and
lengths. A copy command is like a write command, but the sources are
just logical block addresses instead of memory addresses.
Whatever solution happens, it would be a real shame if it doesn't allow
vectored LBAs. The token based source bio doesn't seem to extend to
that.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-28 15:21 ` Keith Busch
@ 2024-11-28 16:40 ` Christoph Hellwig
0 siblings, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-28 16:40 UTC (permalink / raw)
To: Keith Busch
Cc: Martin K. Petersen, Bart Van Assche, Nitesh Shetty,
Javier Gonzalez, Matthew Wilcox, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Thu, Nov 28, 2024 at 08:21:16AM -0700, Keith Busch wrote:
> I think of copy a little differently. When you do a normal write
> command, the host provides the controller a vector of sources and
> lengths. A copy command is like a write command, but the sources are
> just logical block addresses instead of memory addresses.
>
> Whatever solution happens, it would be a real shame if it doesn't allow
> vectored LBAs. The token based source bio doesn't seem to extend to
> that.
POPULATE TOKEN as defined by SCSI/SBC takes a list of LBA ranges as
well.
^ permalink raw reply [flat|nested] 106+ messages in thread
[parent not found: <CGME20241205081138epcas5p2a47090e70c3cf19e562f63cd9fc495d1@epcas5p2.samsung.com>]
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
[not found] ` <CGME20241205081138epcas5p2a47090e70c3cf19e562f63cd9fc495d1@epcas5p2.samsung.com>
@ 2024-12-05 8:03 ` Nitesh Shetty
2024-12-05 20:37 ` Martin K. Petersen
2024-12-09 22:13 ` Bart Van Assche
0 siblings, 2 replies; 106+ messages in thread
From: Nitesh Shetty @ 2024-12-05 8:03 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Bart Van Assche, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
[-- Attachment #1: Type: text/plain, Size: 1439 bytes --]
On 27/11/24 03:14PM, Martin K. Petersen wrote:
>
>Bart,
>
>> Submitting a copy operation as two bios or two requests means that
>> there is a risk that one of the two operations never reaches the block
>> driver at the bottom of the storage stack and hence that a deadlock
>> occurs. I prefer not to introduce any mechanisms that can cause a
>> deadlock.
>
>How do you copy a block range without offload? You perform a READ to
>read the data into memory. And once the READ completes, you do a WRITE
>of the data to the new location.
>
>Token-based copy offload works exactly the same way. You do a POPULATE
>TOKEN which is identical to a READ except you get a cookie instead of
>the actual data. And then once you have the cookie, you perform a WRITE
>USING TOKEN to perform the write operation. Semantically, it's exactly
>the same as a normal copy except for the lack of data movement. That's
>the whole point!
>
Martin
This approach looks simpler to me as well.
But where do we store the read sector info before sending write.
I see 2 approaches here,
1. Should it be part of a payload along with write ?
We did something similar in previous series which was not liked
by Christoph and Bart.
2. Or driver should store it as part of an internal list inside
namespace/ctrl data structure ?
As Bart pointed out, here we might need to send one more fail
request later if copy_write fails to land in same driver.
Thanks,
Nitesh Shetty
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-05 8:03 ` Nitesh Shetty
@ 2024-12-05 20:37 ` Martin K. Petersen
2024-12-10 0:58 ` Bart Van Assche
2024-12-10 7:12 ` Christoph Hellwig
2024-12-09 22:13 ` Bart Van Assche
1 sibling, 2 replies; 106+ messages in thread
From: Martin K. Petersen @ 2024-12-05 20:37 UTC (permalink / raw)
To: Nitesh Shetty
Cc: Martin K. Petersen, Bart Van Assche, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Nitesh,
> This approach looks simpler to me as well.
> But where do we store the read sector info before sending write.
> I see 2 approaches here,
> 1. Should it be part of a payload along with write ? We did something
> similar in previous series which was not liked by Christoph and Bart.
> 2. Or driver should store it as part of an internal list inside
> namespace/ctrl data structure ? As Bart pointed out, here we might
> need to send one more fail request later if copy_write fails to land
> in same driver.
The problem with option 2 is that when you're doing copy between two
different LUNs, then you suddenly have to maintain state in one kernel
object about stuff relating to another kernel object. I think that is
messy. Seems unnecessarily complex.
With option 1, for single command offload, there is no payload to worry
about. Only command completion status matters for the COPY_IN phase. And
once you have completion, you can issue a COPY_OUT. Done.
For token based offload, I really don't understand the objection to
storing the cookie in the bio. I fail to see the benefit of storing the
cookie in the driver and then have the bio refer to something else which
maps to the actual cookie returned by the storage. Again that introduces
object lifetime complexity. It's much simpler to just have the cookie be
part of the very command that is being executed. Once the COPY_IN
completes, you can either use the cookie or throw it away. Doesn't
matter. The device will time it out if you sit on it too long. And there
is zero state in the kernel outside of the memory for the cookie that
you, as the submitter, are responsible for deallocating.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-05 20:37 ` Martin K. Petersen
@ 2024-12-10 0:58 ` Bart Van Assche
2024-12-10 2:20 ` Martin K. Petersen
2024-12-10 7:12 ` Christoph Hellwig
1 sibling, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-12-10 0:58 UTC (permalink / raw)
To: Martin K. Petersen, Nitesh Shetty
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/5/24 12:37 PM, Martin K. Petersen wrote:
> For token based offload, I really don't understand the objection to
> storing the cookie in the bio. I fail to see the benefit of storing the
> cookie in the driver and then have the bio refer to something else which
> maps to the actual cookie returned by the storage.
Does "cookie" refer to the SCSI ROD token? Storing the ROD token in the
REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only
submitted after the REQ_OP_COPY_SRC bio has completed. NVMe users may
prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios are submitted
simultaneously.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 0:58 ` Bart Van Assche
@ 2024-12-10 2:20 ` Martin K. Petersen
2024-12-10 9:53 ` Nitesh Shetty
2024-12-10 19:41 ` Bart Van Assche
0 siblings, 2 replies; 106+ messages in thread
From: Martin K. Petersen @ 2024-12-10 2:20 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Bart,
> Does "cookie" refer to the SCSI ROD token? Storing the ROD token in
> the REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only
> submitted after the REQ_OP_COPY_SRC bio has completed.
Obviously. You can't issue a WRITE USING TOKEN until you have the token.
> NVMe users may prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios
> are submitted simultaneously.
What would be the benefit of submitting these operations concurrently?
As I have explained, it adds substantial complexity and object lifetime
issues throughout the stack. To what end?
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 2:20 ` Martin K. Petersen
@ 2024-12-10 9:53 ` Nitesh Shetty
2024-12-10 21:58 ` Bart Van Assche
2024-12-10 19:41 ` Bart Van Assche
1 sibling, 1 reply; 106+ messages in thread
From: Nitesh Shetty @ 2024-12-10 9:53 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Bart Van Assche, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]
On 09/12/24 09:20PM, Martin K. Petersen wrote:
>
>Bart,
>
>> Does "cookie" refer to the SCSI ROD token? Storing the ROD token in
>> the REQ_OP_COPY_DST bio implies that the REQ_OP_COPY_DST bio is only
>> submitted after the REQ_OP_COPY_SRC bio has completed.
>
>Obviously. You can't issue a WRITE USING TOKEN until you have the token.
>
>> NVMe users may prefer that REQ_OP_COPY_SRC and REQ_OP_COPY_DST bios
>> are submitted simultaneously.
>
>What would be the benefit of submitting these operations concurrently?
>As I have explained, it adds substantial complexity and object lifetime
>issues throughout the stack. To what end?
>
>--
Bart,
We did implement payload based approach in the past[1] which aligns
with this. Since we wait till the REQ_OP_COPY_SRC completes, there won't
be issue with async type of dm IOs.
Since this would be an internal kernel plumbing, we can optimize/change
the approach moving forward.
If you are okay with the approach, I can give a respin to that version.
Thanks,
Nitesh Shetty
[1]
https://lore.kernel.org/linux-block/[email protected]/T/#mecd04c060cd4285a4b036ca79cc58713308771fe
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 9:53 ` Nitesh Shetty
@ 2024-12-10 21:58 ` Bart Van Assche
0 siblings, 0 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-12-10 21:58 UTC (permalink / raw)
To: Nitesh Shetty, Martin K. Petersen
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/10/24 1:53 AM, Nitesh Shetty wrote:
> We did implement payload based approach in the past[1] which aligns
> with this. Since we wait till the REQ_OP_COPY_SRC completes, there won't
> be issue with async type of dm IOs.
> Since this would be an internal kernel plumbing, we can optimize/change
> the approach moving forward.
> If you are okay with the approach, I can give a respin to that version.
Yes, I remember this. Let's wait with respinning/reposting until there
is agreement about the approach for copy offloading.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 2:20 ` Martin K. Petersen
2024-12-10 9:53 ` Nitesh Shetty
@ 2024-12-10 19:41 ` Bart Van Assche
2024-12-11 9:36 ` Nitesh Shetty
2024-12-11 19:38 ` Martin K. Petersen
1 sibling, 2 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-12-10 19:41 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Nitesh Shetty, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/9/24 6:20 PM, Martin K. Petersen wrote:
> What would be the benefit of submitting these operations concurrently?
I expect that submitting the two copy operations concurrently would
result in lower latency for NVMe devices because the REQ_OP_COPY_DST
operation can be submitted without waiting for the REQ_OP_COPY_SRC
result.
> As I have explained, it adds substantial complexity and object lifetime
> issues throughout the stack. To what end?
I think the approach of embedding the ROD token in the bio payload would
add complexity in the block layer. The token-based copy offload approach
involves submitting at least the following commands to the SCSI device:
* POPULATE TOKEN with a list identifier and source data ranges as
parameters to send the source data ranges to the device.
* RECEIVE ROD TOKEN INFORMATION with a list identifier as parameter to
receive the ROD token.
* WRITE USING TOKEN with the ROD token and the destination ranges as
parameters to tell the device to start the copy operation.
If the block layer would have to manage the ROD token, how would the ROD
token be provided to the block layer? Bidirectional commands have been
removed from the Linux kernel a while ago so the REQ_OP_COPY_IN
parameter data would have to be used to pass parameters to the SCSI
driver and also to pass the ROD token back to the block layer. A
possible approach is to let the SCSI core allocate memory for the ROD
token with kmalloc and to pass that pointer back to the block layer
by writing that pointer into the REQ_OP_COPY_IN parameter data. While
this can be implemented, I'm not sure that we should integrate support
in the block layer for managing ROD tokens since ROD tokens are a
concept that is specific to the SCSI protocol.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 19:41 ` Bart Van Assche
@ 2024-12-11 9:36 ` Nitesh Shetty
2024-12-11 17:27 ` Bart Van Assche
2024-12-11 19:38 ` Martin K. Petersen
1 sibling, 1 reply; 106+ messages in thread
From: Nitesh Shetty @ 2024-12-11 9:36 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
[-- Attachment #1: Type: text/plain, Size: 2454 bytes --]
On 10/12/24 11:41AM, Bart Van Assche wrote:
>On 12/9/24 6:20 PM, Martin K. Petersen wrote:
>>What would be the benefit of submitting these operations concurrently?
>
>I expect that submitting the two copy operations concurrently would
>result in lower latency for NVMe devices because the REQ_OP_COPY_DST
>operation can be submitted without waiting for the REQ_OP_COPY_SRC
>result.
>
>>As I have explained, it adds substantial complexity and object lifetime
>>issues throughout the stack. To what end?
>
>I think the approach of embedding the ROD token in the bio payload would
>add complexity in the block layer. The token-based copy offload approach
>involves submitting at least the following commands to the SCSI device:
>* POPULATE TOKEN with a list identifier and source data ranges as
> parameters to send the source data ranges to the device.
>* RECEIVE ROD TOKEN INFORMATION with a list identifier as parameter to
> receive the ROD token.
>* WRITE USING TOKEN with the ROD token and the destination ranges as
> parameters to tell the device to start the copy operation.
>
>If the block layer would have to manage the ROD token, how would the ROD
>token be provided to the block layer? Bidirectional commands have been
>removed from the Linux kernel a while ago so the REQ_OP_COPY_IN
>parameter data would have to be used to pass parameters to the SCSI
>driver and also to pass the ROD token back to the block layer. A
>possible approach is to let the SCSI core allocate memory for the ROD
>token with kmalloc and to pass that pointer back to the block layer
>by writing that pointer into the REQ_OP_COPY_IN parameter data. While
>this can be implemented, I'm not sure that we should integrate support
>in the block layer for managing ROD tokens since ROD tokens are a
>concept that is specific to the SCSI protocol.
>
Block layer can allocate a buffer and send this as part of copy
operation.
Driver can store token/custom info inside the buffer sent along with
REQ_OP_COPY_SRC and expect that block layer sends back this info/buffer
again in REQ_OP_COPY_DST ?
This will reduce the effort for block layer to manage the lifetime issues.
Is there any reason, why we cant store the info inside this buffer in
driver ?
This scheme will require sequential submission of SRC and DST
bio's. This might increase in latency, but allows to have simpler design.
Main use case for copy is GC, which is mostly a background operation.
--
Nitesh Shetty
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-11 9:36 ` Nitesh Shetty
@ 2024-12-11 17:27 ` Bart Van Assche
0 siblings, 0 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-12-11 17:27 UTC (permalink / raw)
To: Nitesh Shetty
Cc: Martin K. Petersen, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/11/24 1:36 AM, Nitesh Shetty wrote:
> Block layer can allocate a buffer and send this as part of copy
> operation.
The block layer can only do that if it knows how large the buffer should
be. Or in other words, if knowledge of a SCSI buffer size is embedded in
the block layer. That doesn't sound ideal to me.
> This scheme will require sequential submission of SRC and DST
> bio's. This might increase in latency, but allows to have simpler design.
> Main use case for copy is GC, which is mostly a background operation.
I still prefer a single REQ_OP_COPY operation instead of separate
REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations. While this will require
additional work in the SCSI disk (sd) driver (implementation of a state
machine), it prevents that any details about the SCSI copy offloading
approach have to be known by the block layer. Even if copy offloading
would be implemented as two operations (REQ_OP_COPY_SRC and
REQ_OP_COPY_DST), a state machine is required anyway in the SCSI disk
driver because REQ_OP_COPY_SRC would have to be translated into two SCSI
commands (POPULATE TOKEN + RECEIVE ROD TOKEN INFORMATION).
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 19:41 ` Bart Van Assche
2024-12-11 9:36 ` Nitesh Shetty
@ 2024-12-11 19:38 ` Martin K. Petersen
1 sibling, 0 replies; 106+ messages in thread
From: Martin K. Petersen @ 2024-12-11 19:38 UTC (permalink / raw)
To: Bart Van Assche
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Bart,
>> What would be the benefit of submitting these operations concurrently?
>
> I expect that submitting the two copy operations concurrently would
> result in lower latency for NVMe devices because the REQ_OP_COPY_DST
> operation can be submitted without waiting for the REQ_OP_COPY_SRC
> result.
Perhaps you are engaging in premature optimization?
> If the block layer would have to manage the ROD token, how would the
> ROD token be provided to the block layer?
In the data buffer described by the bio, of course. Just like the data
buffer when we do a READ. Only difference here is that the data is
compressed to a fixed size and thus only 512 bytes long regardless of
the amount of logical blocks described by the operation.
> Bidirectional commands have been removed from the Linux kernel a while
> ago so the REQ_OP_COPY_IN parameter data would have to be used to pass
> parameters to the SCSI driver and also to pass the ROD token back to
> the block layer.
A normal READ operation also passes parameters to the SCSI driver. These
are the start LBA and the transfer length. That does not make it a
bidirectional command.
> While this can be implemented, I'm not sure that we should integrate
> support in the block layer for managing ROD tokens since ROD tokens
> are a concept that is specific to the SCSI protocol.
A well-known commercial operating system supports copy offload via the
token-based approach. I don't see any reason why our implementation
should exclude a wide variety of devices in the industry supported by
that platform. And obviously, given that this other operating system
uses a token-based implementation in their stack, one could perhaps
envision this capability appearing in other protocols in the future?
In any case. I only have two horses in this race:
1. Make sure that our user API and block layer implementation are
flexible enough to accommodate current and future offload
specifications.
2. Make sure our implementation is as simple as possible.
Splitting the block layer implementation into a semantic read followed
by a semantic write permits token-based offload to be supported. It also
makes the implementation simple because there is no concurrency element.
The only state is owned by the entity which issues the bio. No lookups,
no timeouts, no allocating things in sd.c and hoping that somebody
remembers to free them later despite the disk suddenly going away.
Even if we were to not support the token-based approach and only do
single-command offload, I still think the two-phase operation makes
things simpler and more elegant.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-05 20:37 ` Martin K. Petersen
2024-12-10 0:58 ` Bart Van Assche
@ 2024-12-10 7:12 ` Christoph Hellwig
2024-12-10 8:05 ` Johannes Thumshirn
2024-12-11 19:41 ` Martin K. Petersen
1 sibling, 2 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-12-10 7:12 UTC (permalink / raw)
To: Martin K. Petersen
Cc: Nitesh Shetty, Bart Van Assche, Javier Gonzalez, Matthew Wilcox,
Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Thu, Dec 05, 2024 at 03:37:25PM -0500, Martin K. Petersen wrote:
> The problem with option 2 is that when you're doing copy between two
> different LUNs, then you suddenly have to maintain state in one kernel
> object about stuff relating to another kernel object. I think that is
> messy. Seems unnecessarily complex.
Generally agreeing with all you said, but do we actually have any
serious use case for cross-LU copies? They just seem incredibly
complex any not all that useful.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 7:12 ` Christoph Hellwig
@ 2024-12-10 8:05 ` Johannes Thumshirn
2024-12-10 10:58 ` hch
2024-12-11 19:41 ` Martin K. Petersen
1 sibling, 1 reply; 106+ messages in thread
From: Johannes Thumshirn @ 2024-12-10 8:05 UTC (permalink / raw)
To: hch, Martin K. Petersen
Cc: Nitesh Shetty, Bart Van Assche, Javier Gonzalez, Matthew Wilcox,
Keith Busch, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 10.12.24 08:13, Christoph Hellwig wrote:
> On Thu, Dec 05, 2024 at 03:37:25PM -0500, Martin K. Petersen wrote:
>> The problem with option 2 is that when you're doing copy between two
>> different LUNs, then you suddenly have to maintain state in one kernel
>> object about stuff relating to another kernel object. I think that is
>> messy. Seems unnecessarily complex.
>
> Generally agreeing with all you said, but do we actually have any
> serious use case for cross-LU copies? They just seem incredibly
> complex any not all that useful.
One use case I can think of is (again) btrfs balance (GC, convert, etc)
on a multi drive filesystem. BUT this use case is something that can
just use the fallback read-write path as it is doing now.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 8:05 ` Johannes Thumshirn
@ 2024-12-10 10:58 ` hch
2024-12-10 19:21 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: hch @ 2024-12-10 10:58 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: hch, Martin K. Petersen, Nitesh Shetty, Bart Van Assche,
Javier Gonzalez, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote:
> > Generally agreeing with all you said, but do we actually have any
> > serious use case for cross-LU copies? They just seem incredibly
> > complex any not all that useful.
>
> One use case I can think of is (again) btrfs balance (GC, convert, etc)
> on a multi drive filesystem. BUT this use case is something that can
> just use the fallback read-write path as it is doing now.
Who uses multi-device file systems on multiple LUs of the same SCSI
target ơr multiple namespaces on the same nvme subsystem?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 10:58 ` hch
@ 2024-12-10 19:21 ` Bart Van Assche
2024-12-11 4:07 ` Damien Le Moal
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-12-10 19:21 UTC (permalink / raw)
To: hch, Johannes Thumshirn
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 12/10/24 2:58 AM, hch wrote:
> On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote:
>>> Generally agreeing with all you said, but do we actually have any
>>> serious use case for cross-LU copies? They just seem incredibly
>>> complex any not all that useful.
>>
>> One use case I can think of is (again) btrfs balance (GC, convert, etc)
>> on a multi drive filesystem. BUT this use case is something that can
>> just use the fallback read-write path as it is doing now.
>
> Who uses multi-device file systems on multiple LUs of the same SCSI
> target ơr multiple namespaces on the same nvme subsystem?
On Android systems F2FS combines a small conventional logical unit and a
large zoned logical unit into a single filesystem. This use case will
benefit from copy offloading between different logical units on the same
SCSI device. While there may be disagreement about how desirable this
setup is from a technical point of view, there is a real use case today
for offloading data copying between different logical units.
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 19:21 ` Bart Van Assche
@ 2024-12-11 4:07 ` Damien Le Moal
2024-12-11 21:06 ` Bart Van Assche
0 siblings, 1 reply; 106+ messages in thread
From: Damien Le Moal @ 2024-12-11 4:07 UTC (permalink / raw)
To: Bart Van Assche, hch, Johannes Thumshirn
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 12/11/24 4:21 AM, Bart Van Assche wrote:
> On 12/10/24 2:58 AM, hch wrote:
>> On Tue, Dec 10, 2024 at 08:05:31AM +0000, Johannes Thumshirn wrote:
>>>> Generally agreeing with all you said, but do we actually have any
>>>> serious use case for cross-LU copies? They just seem incredibly
>>>> complex any not all that useful.
>>>
>>> One use case I can think of is (again) btrfs balance (GC, convert, etc)
>>> on a multi drive filesystem. BUT this use case is something that can
>>> just use the fallback read-write path as it is doing now.
>>
>> Who uses multi-device file systems on multiple LUs of the same SCSI
>> target ơr multiple namespaces on the same nvme subsystem?
>
> On Android systems F2FS combines a small conventional logical unit and a
> large zoned logical unit into a single filesystem. This use case will
> benefit from copy offloading between different logical units on the same
> SCSI device. While there may be disagreement about how desirable this
> setup is from a technical point of view, there is a real use case today
> for offloading data copying between different logical units.
But for F2FS, the conventional unit is used for metadata and the other zoned LU
for data. How come copying from one to the other can be useful ?
>
> Bart.
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-11 4:07 ` Damien Le Moal
@ 2024-12-11 21:06 ` Bart Van Assche
2024-12-11 21:21 ` Jaegeuk Kim
0 siblings, 1 reply; 106+ messages in thread
From: Bart Van Assche @ 2024-12-11 21:06 UTC (permalink / raw)
To: Damien Le Moal, hch, Johannes Thumshirn
Cc: Martin K. Petersen, Nitesh Shetty, Javier Gonzalez,
Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], Jaegeuk Kim
On 12/10/24 8:07 PM, Damien Le Moal wrote:
> But for F2FS, the conventional unit is used for metadata and the other zoned LU
> for data. How come copying from one to the other can be useful ?
Hi Damien,
What you wrote is correct in general. If a conventional and zoned LU are
combined, data is only written to the conventional LU once the zoned LU
is full. The data on the conventional LU may be migrated to the zoned LU
during garbage collection. This is why copying from the conventional LU
to the zoned LU is useful.
Jaegeuk, please correct me if I got this wrong.
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-11 21:06 ` Bart Van Assche
@ 2024-12-11 21:21 ` Jaegeuk Kim
0 siblings, 0 replies; 106+ messages in thread
From: Jaegeuk Kim @ 2024-12-11 21:21 UTC (permalink / raw)
To: Bart Van Assche
Cc: Damien Le Moal, hch, Johannes Thumshirn, Martin K. Petersen,
Nitesh Shetty, Javier Gonzalez, Matthew Wilcox, Keith Busch,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/11, Bart Van Assche wrote:
> On 12/10/24 8:07 PM, Damien Le Moal wrote:
> > But for F2FS, the conventional unit is used for metadata and the other zoned LU
> > for data. How come copying from one to the other can be useful ?
>
> Hi Damien,
>
> What you wrote is correct in general. If a conventional and zoned LU are
> combined, data is only written to the conventional LU once the zoned LU
> is full. The data on the conventional LU may be migrated to the zoned LU
> during garbage collection. This is why copying from the conventional LU
> to the zoned LU is useful.
>
> Jaegeuk, please correct me if I got this wrong.
Bart is correct. It doesn't make sense forcing to use conventional LU for
metadata only, but shows the remaining conventional LU space to user as well.
In order to do that, we exploited the multi-partition support in F2FS by
aligning the section to zoned partition offset simply.
>
> Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-10 7:12 ` Christoph Hellwig
2024-12-10 8:05 ` Johannes Thumshirn
@ 2024-12-11 19:41 ` Martin K. Petersen
1 sibling, 0 replies; 106+ messages in thread
From: Martin K. Petersen @ 2024-12-11 19:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Martin K. Petersen, Nitesh Shetty, Bart Van Assche,
Javier Gonzalez, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Christoph,
> Generally agreeing with all you said, but do we actually have any
> serious use case for cross-LU copies? They just seem incredibly
> complex any not all that useful.
It's still widely used to populate a new LUN from a golden image.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-05 8:03 ` Nitesh Shetty
2024-12-05 20:37 ` Martin K. Petersen
@ 2024-12-09 22:13 ` Bart Van Assche
2024-12-09 23:13 ` Damien Le Moal
2024-12-09 23:31 ` Matthew Wilcox
1 sibling, 2 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-12-09 22:13 UTC (permalink / raw)
To: Nitesh Shetty, Martin K. Petersen
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/5/24 12:03 AM, Nitesh Shetty wrote:
> But where do we store the read sector info before sending write.
> I see 2 approaches here,
> 1. Should it be part of a payload along with write ?
> We did something similar in previous series which was not liked
> by Christoph and Bart.
> 2. Or driver should store it as part of an internal list inside
> namespace/ctrl data structure ?
> As Bart pointed out, here we might need to send one more fail
> request later if copy_write fails to land in same driver.
Hi Nitesh,
Consider the following example: dm-linear is used to concatenate two
block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA
1000..1999). Suppose that a copy operation is submitted to the dm-linear
device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is
submitted as two separate operations (REQ_OP_COPY_SRC and
REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC
operation and the SCSI device will receive the REQ_OP_COPY_DST
operation. The NVMe and SCSI device drivers should fail the copy
operations after a timeout because they only received half of the copy
operation. After the timeout the block layer core can switch from
offloading to emulating a copy operation. Waiting for a timeout is
necessary because requests may be reordered.
I think this is a strong argument in favor of representing copy
operations as a single operation. This will allow stacking drivers
as dm-linear to deal in an elegant way with copy offload requests
where source and destination LBA ranges map onto different block
devices and potentially different block drivers.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-09 22:13 ` Bart Van Assche
@ 2024-12-09 23:13 ` Damien Le Moal
2024-12-09 23:31 ` Matthew Wilcox
1 sibling, 0 replies; 106+ messages in thread
From: Damien Le Moal @ 2024-12-09 23:13 UTC (permalink / raw)
To: Bart Van Assche, Nitesh Shetty, Martin K. Petersen
Cc: Javier Gonzalez, Matthew Wilcox, Keith Busch, Christoph Hellwig,
Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/10/24 07:13, Bart Van Assche wrote:
> On 12/5/24 12:03 AM, Nitesh Shetty wrote:
>> But where do we store the read sector info before sending write.
>> I see 2 approaches here,
>> 1. Should it be part of a payload along with write ?
>> We did something similar in previous series which was not liked
>> by Christoph and Bart.
>> 2. Or driver should store it as part of an internal list inside
>> namespace/ctrl data structure ?
>> As Bart pointed out, here we might need to send one more fail
>> request later if copy_write fails to land in same driver.
>
> Hi Nitesh,
>
> Consider the following example: dm-linear is used to concatenate two
> block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA
> 1000..1999). Suppose that a copy operation is submitted to the dm-linear
> device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is
> submitted as two separate operations (REQ_OP_COPY_SRC and
> REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC
> operation and the SCSI device will receive the REQ_OP_COPY_DST
> operation. The NVMe and SCSI device drivers should fail the copy
> operations after a timeout because they only received half of the copy
> operation. After the timeout the block layer core can switch from
> offloading to emulating a copy operation. Waiting for a timeout is
> necessary because requests may be reordered.
>
> I think this is a strong argument in favor of representing copy
> operations as a single operation. This will allow stacking drivers
> as dm-linear to deal in an elegant way with copy offload requests
> where source and destination LBA ranges map onto different block
> devices and potentially different block drivers.
Why ? As long as REQ_OP_COPY_SRC carries both source and destination
information, DM can trivially detect that the copy is not within a single device
and either return ENOTSUPP or switch to using a regular read+write operations
using block layer helpers. Or the block layer can fallback to that emulation
itself if it gets a ENOTSUPP from the device.
I am not sure how a REQ_OP_COPY_SRC BIO definition would look like. Ideally, we
want to be able to describe several source LBA ranges with it and for the above
issue also have the destination LBA range as well. If we can do that in a nice
way, I do not see the need for switching back to a single BIO, though we could
too I guess. From what Martin said for scsi token-based copy, it seems that 2
operations is easier. Knowing how the scsi stack works, I can see that too.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-09 22:13 ` Bart Van Assche
2024-12-09 23:13 ` Damien Le Moal
@ 2024-12-09 23:31 ` Matthew Wilcox
2024-12-10 0:22 ` Bart Van Assche
1 sibling, 1 reply; 106+ messages in thread
From: Matthew Wilcox @ 2024-12-09 23:31 UTC (permalink / raw)
To: Bart Van Assche
Cc: Nitesh Shetty, Martin K. Petersen, Javier Gonzalez, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Mon, Dec 09, 2024 at 02:13:40PM -0800, Bart Van Assche wrote:
> On 12/5/24 12:03 AM, Nitesh Shetty wrote:
> > But where do we store the read sector info before sending write.
> > I see 2 approaches here,
> > 1. Should it be part of a payload along with write ?
> > We did something similar in previous series which was not liked
> > by Christoph and Bart.
> > 2. Or driver should store it as part of an internal list inside
> > namespace/ctrl data structure ?
> > As Bart pointed out, here we might need to send one more fail
> > request later if copy_write fails to land in same driver.
>
> Hi Nitesh,
>
> Consider the following example: dm-linear is used to concatenate two
> block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA
> 1000..1999). Suppose that a copy operation is submitted to the dm-linear
> device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is
Sorry, I don't think that's a valid operation -- 1998 - 2 = 1996 and 998
- 1 is 997, so these ranges are of different lengths.
I presume you're trying to construct an operation which is entirely
reading within the first device, and then is going to write across both
devices. So let's say you want to read 1-900 and write to 501-1400.
> submitted as two separate operations (REQ_OP_COPY_SRC and
> REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC
> operation and the SCSI device will receive the REQ_OP_COPY_DST
> operation. The NVMe and SCSI device drivers should fail the copy operations
> after a timeout because they only received half of the copy
> operation.
... no? The SRC operation succeeds, but then the DM driver gets the DST
operation and sees that it crosses the boundary and fails the DST op.
Then the pair of ops can be retried using an in-memory buffer.
I'm not quite clear on the atomicity; whether there could be an initial
copy of 500-900 to 1000-1400 and then a remap of 1-499 to 501-999.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-12-09 23:31 ` Matthew Wilcox
@ 2024-12-10 0:22 ` Bart Van Assche
0 siblings, 0 replies; 106+ messages in thread
From: Bart Van Assche @ 2024-12-10 0:22 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Nitesh Shetty, Martin K. Petersen, Javier Gonzalez, Keith Busch,
Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 12/9/24 3:31 PM, Matthew Wilcox wrote:
> On Mon, Dec 09, 2024 at 02:13:40PM -0800, Bart Van Assche wrote:
>> Consider the following example: dm-linear is used to concatenate two
>> block devices. An NVMe device (LBA 0..999) and a SCSI device (LBA
>> 1000..1999). Suppose that a copy operation is submitted to the dm-linear
>> device to copy LBAs 1..998 to LBAs 2..1998. If the copy operation is
>
> Sorry, I don't think that's a valid operation -- 1998 - 2 = 1996 and 998
> - 1 is 997, so these ranges are of different lengths.
Agreed that the ranges should have the same length. I have been
traveling and I'm under jet lag, hence the range length mismatch. I
wanted to construct a copy operation from the first to the second block
device: 1..998 to 1001..1998.
>> submitted as two separate operations (REQ_OP_COPY_SRC and
>> REQ_OP_COPY_DST) then the NVMe device will receive the REQ_OP_COPY_SRC
>> operation and the SCSI device will receive the REQ_OP_COPY_DST
>> operation. The NVMe and SCSI device drivers should fail the copy operations
>> after a timeout because they only received half of the copy
>> operation.
>
> ... no? The SRC operation succeeds, but then the DM driver gets the DST
> operation and sees that it crosses the boundary and fails the DST op.
> Then the pair of ops can be retried using an in-memory buffer.
Since the second range can be mapped onto the second block device, the
dm-linear driver can only fail the REQ_OP_COPY_DST operation if it keeps
track of the source LBA regions of pending copy operations. Which would
be an unnecessary complexity.
A possible alternative is to specify the source and destination range
information in every REQ_OP_COPY_SRC and in every REQ_OP_COPY_DST
operation (see also Damien's email).
Thanks,
Bart.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 17:43 ` Javier Gonzalez
2024-11-08 18:51 ` Bart Van Assche
@ 2024-11-11 6:51 ` Christoph Hellwig
2024-11-11 9:30 ` Javier Gonzalez
1 sibling, 1 reply; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-11 6:51 UTC (permalink / raw)
To: Javier Gonzalez
Cc: Matthew Wilcox, Keith Busch, Christoph Hellwig, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
> We have been iterating in the patches for years, but it is unfortunately
> one of these series that go in circles forever. I don't think it is due
> to any specific problem, but mostly due to unaligned requests form
> different folks reviewing. Last time I talked to Damien he asked me to
> send the patches again; we have not followed through due to bandwidth.
A big problem is that it actually lacks a killer use case. If you'd
actually manage to plug it into an in-kernel user and show a real
speedup people might actually be interested in it and help optimizing
for it.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 6:51 ` Christoph Hellwig
@ 2024-11-11 9:30 ` Javier Gonzalez
2024-11-11 9:37 ` Johannes Thumshirn
0 siblings, 1 reply; 106+ messages in thread
From: Javier Gonzalez @ 2024-11-11 9:30 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11.11.2024 07:51, Christoph Hellwig wrote:
>On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>> We have been iterating in the patches for years, but it is unfortunately
>> one of these series that go in circles forever. I don't think it is due
>> to any specific problem, but mostly due to unaligned requests form
>> different folks reviewing. Last time I talked to Damien he asked me to
>> send the patches again; we have not followed through due to bandwidth.
>
>A big problem is that it actually lacks a killer use case. If you'd
>actually manage to plug it into an in-kernel user and show a real
>speedup people might actually be interested in it and help optimizing
>for it.
>
Agree. Initially it was all about ZNS. Seems ZUFS can use it.
Then we saw good results in offload to target on NVMe-OF, similar to
copy_file_range, but that does not seem to be enough. You seem to
indicacte too that XFS can use it for GC.
We can try putting a new series out to see where we are...
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:30 ` Javier Gonzalez
@ 2024-11-11 9:37 ` Johannes Thumshirn
2024-11-11 9:41 ` Javier Gonzalez
0 siblings, 1 reply; 106+ messages in thread
From: Johannes Thumshirn @ 2024-11-11 9:37 UTC (permalink / raw)
To: Javier Gonzalez, hch
Cc: Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11.11.24 10:31, Javier Gonzalez wrote:
> On 11.11.2024 07:51, Christoph Hellwig wrote:
>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>> We have been iterating in the patches for years, but it is unfortunately
>>> one of these series that go in circles forever. I don't think it is due
>>> to any specific problem, but mostly due to unaligned requests form
>>> different folks reviewing. Last time I talked to Damien he asked me to
>>> send the patches again; we have not followed through due to bandwidth.
>>
>> A big problem is that it actually lacks a killer use case. If you'd
>> actually manage to plug it into an in-kernel user and show a real
>> speedup people might actually be interested in it and help optimizing
>> for it.
>>
>
> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>
> Then we saw good results in offload to target on NVMe-OF, similar to
> copy_file_range, but that does not seem to be enough. You seem to
> indicacte too that XFS can use it for GC.
>
> We can try putting a new series out to see where we are...
I don't want to sound like a broken record, but I've said more than
once, that btrfs (regardless of zoned or non-zoned) would be very
interested in that as well and I'd be willing to help with the code or
even do it myself once the block bits are in.
But apparently my voice doesn't count here
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:37 ` Johannes Thumshirn
@ 2024-11-11 9:41 ` Javier Gonzalez
2024-11-11 9:42 ` hch
2024-11-11 9:43 ` Johannes Thumshirn
0 siblings, 2 replies; 106+ messages in thread
From: Javier Gonzalez @ 2024-11-11 9:41 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: hch, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11.11.2024 09:37, Johannes Thumshirn wrote:
>On 11.11.24 10:31, Javier Gonzalez wrote:
>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>> We have been iterating in the patches for years, but it is unfortunately
>>>> one of these series that go in circles forever. I don't think it is due
>>>> to any specific problem, but mostly due to unaligned requests form
>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>> send the patches again; we have not followed through due to bandwidth.
>>>
>>> A big problem is that it actually lacks a killer use case. If you'd
>>> actually manage to plug it into an in-kernel user and show a real
>>> speedup people might actually be interested in it and help optimizing
>>> for it.
>>>
>>
>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>
>> Then we saw good results in offload to target on NVMe-OF, similar to
>> copy_file_range, but that does not seem to be enough. You seem to
>> indicacte too that XFS can use it for GC.
>>
>> We can try putting a new series out to see where we are...
>
>I don't want to sound like a broken record, but I've said more than
>once, that btrfs (regardless of zoned or non-zoned) would be very
>interested in that as well and I'd be willing to help with the code or
>even do it myself once the block bits are in.
>
>But apparently my voice doesn't count here
You are right. Sorry I forgot.
Would this be through copy_file_range or something different?
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:41 ` Javier Gonzalez
@ 2024-11-11 9:42 ` hch
2024-11-11 9:43 ` Johannes Thumshirn
1 sibling, 0 replies; 106+ messages in thread
From: hch @ 2024-11-11 9:42 UTC (permalink / raw)
To: Javier Gonzalez
Cc: Johannes Thumshirn, hch, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote:
> You are right. Sorry I forgot.
>
> Would this be through copy_file_range or something different?
Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user
would be the file system GC code.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:41 ` Javier Gonzalez
2024-11-11 9:42 ` hch
@ 2024-11-11 9:43 ` Johannes Thumshirn
2024-11-11 10:37 ` Javier Gonzalez
1 sibling, 1 reply; 106+ messages in thread
From: Johannes Thumshirn @ 2024-11-11 9:43 UTC (permalink / raw)
To: Javier Gonzalez
Cc: hch, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11.11.24 10:41, Javier Gonzalez wrote:
> On 11.11.2024 09:37, Johannes Thumshirn wrote:
>> On 11.11.24 10:31, Javier Gonzalez wrote:
>>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>>> We have been iterating in the patches for years, but it is unfortunately
>>>>> one of these series that go in circles forever. I don't think it is due
>>>>> to any specific problem, but mostly due to unaligned requests form
>>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>>> send the patches again; we have not followed through due to bandwidth.
>>>>
>>>> A big problem is that it actually lacks a killer use case. If you'd
>>>> actually manage to plug it into an in-kernel user and show a real
>>>> speedup people might actually be interested in it and help optimizing
>>>> for it.
>>>>
>>>
>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>>
>>> Then we saw good results in offload to target on NVMe-OF, similar to
>>> copy_file_range, but that does not seem to be enough. You seem to
>>> indicacte too that XFS can use it for GC.
>>>
>>> We can try putting a new series out to see where we are...
>>
>> I don't want to sound like a broken record, but I've said more than
>> once, that btrfs (regardless of zoned or non-zoned) would be very
>> interested in that as well and I'd be willing to help with the code or
>> even do it myself once the block bits are in.
>>
>> But apparently my voice doesn't count here
>
> You are right. Sorry I forgot.
>
> Would this be through copy_file_range or something different?
>
Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of
buffered read and write (plus some extra things). _BUT_ this makes it
possible to switch the read/write part and do copy offload (where possible).
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-11 9:43 ` Johannes Thumshirn
@ 2024-11-11 10:37 ` Javier Gonzalez
0 siblings, 0 replies; 106+ messages in thread
From: Javier Gonzalez @ 2024-11-11 10:37 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: hch, Matthew Wilcox, Keith Busch, Keith Busch,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 11.11.2024 09:43, Johannes Thumshirn wrote:
>On 11.11.24 10:41, Javier Gonzalez wrote:
>> On 11.11.2024 09:37, Johannes Thumshirn wrote:
>>> On 11.11.24 10:31, Javier Gonzalez wrote:
>>>> On 11.11.2024 07:51, Christoph Hellwig wrote:
>>>>> On Fri, Nov 08, 2024 at 05:43:44PM +0000, Javier Gonzalez wrote:
>>>>>> We have been iterating in the patches for years, but it is unfortunately
>>>>>> one of these series that go in circles forever. I don't think it is due
>>>>>> to any specific problem, but mostly due to unaligned requests form
>>>>>> different folks reviewing. Last time I talked to Damien he asked me to
>>>>>> send the patches again; we have not followed through due to bandwidth.
>>>>>
>>>>> A big problem is that it actually lacks a killer use case. If you'd
>>>>> actually manage to plug it into an in-kernel user and show a real
>>>>> speedup people might actually be interested in it and help optimizing
>>>>> for it.
>>>>>
>>>>
>>>> Agree. Initially it was all about ZNS. Seems ZUFS can use it.
>>>>
>>>> Then we saw good results in offload to target on NVMe-OF, similar to
>>>> copy_file_range, but that does not seem to be enough. You seem to
>>>> indicacte too that XFS can use it for GC.
>>>>
>>>> We can try putting a new series out to see where we are...
>>>
>>> I don't want to sound like a broken record, but I've said more than
>>> once, that btrfs (regardless of zoned or non-zoned) would be very
>>> interested in that as well and I'd be willing to help with the code or
>>> even do it myself once the block bits are in.
>>>
>>> But apparently my voice doesn't count here
>>
>> You are right. Sorry I forgot.
>>
>> Would this be through copy_file_range or something different?
>>
>
>Unfortunately not, brtfs' reclaim/balance path is a wrapper on top of
>buffered read and write (plus some extra things). _BUT_ this makes it
>possible to switch the read/write part and do copy offload (where possible).
On 11.11.2024 10:42, hch wrote:
>On Mon, Nov 11, 2024 at 10:41:33AM +0100, Javier Gonzalez wrote:
>> You are right. Sorry I forgot.
>>
>> Would this be through copy_file_range or something different?
>
>Just like for f2fs, nilfs2, or the upcoming zoned xfs the prime user
>would be the file system GC code.
Replying to both.
Thanks. Makes sense. Now that we can talke a look at your branch, we can
think how this would look like.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 16:54 ` Matthew Wilcox
2024-11-08 17:43 ` Javier Gonzalez
@ 2024-11-11 6:49 ` Christoph Hellwig
1 sibling, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-11 6:49 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Keith Busch, Christoph Hellwig, Keith Busch, linux-block,
linux-nvme, linux-scsi, io-uring, linux-fsdevel, joshi.k,
javier.gonz, bvanassche
On Fri, Nov 08, 2024 at 04:54:34PM +0000, Matthew Wilcox wrote:
> I like the idea of using copy-offload though.
FYI, the XFS GC code is written so that copy offload can be easily
plugged into it. We'll have to see how beneficial it actually is,
but at least it should give us a good test platform.
^ permalink raw reply [flat|nested] 106+ messages in thread
* Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams
2024-11-08 15:51 ` Keith Busch
2024-11-08 16:54 ` Matthew Wilcox
@ 2024-11-11 6:48 ` Christoph Hellwig
1 sibling, 0 replies; 106+ messages in thread
From: Christoph Hellwig @ 2024-11-11 6:48 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme,
linux-scsi, io-uring, linux-fsdevel, joshi.k, javier.gonz,
bvanassche
On Fri, Nov 08, 2024 at 08:51:31AM -0700, Keith Busch wrote:
> You're getting fragmentation anyway, which is why you had to implement
> gc.
A general purpose file system always has fragmentation of some kind,
even it manages to avoid those for certain workloads with cooperative
applications.
If there was magic pixies dust to ensure freespace never fragments file
system development would be solved problem :)
> You're just shifting who gets to deal with it from the controller to
> the host. The host is further from the media, so you're starting from a
> disadvantage.
And the controller is further from the application and misses a lot of
information like say the file structure, so it inherently is at a
disadvantage.
> The host gc implementation would have to be quite a bit
> better to justify the link and memory usage necessary for the copies
That assumes you still have to device GC. If you do align to the
zone/erase (super)block/reclaim unit boundaries you don't.
> This xfs implementation also has logic to recover from a power fail. The
> device already does that if you use the LBA abstraction instead of
> tracking sequential write pointers and free blocks.
Every file system has logic to recover from a power fail. I'm not sure
what kind of discussion you're trying to kick off here.
> I think you are underestimating the duplication of efforts going on
> here.
I'm still not sure what discussion you're trying to to start here.
There is very little work in here, and it is work required to support
SMR drives. It turns out for a fair amount of workloads it actually
works really well on SSDs as well beating everything else we've tried.
^ permalink raw reply [flat|nested] 106+ messages in thread