* [PATCHv8 0/6] write hints for nvme fdp
@ 2024-10-17 16:09 Keith Busch
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
` (5 more replies)
0 siblings, 6 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
From: Keith Busch <[email protected]>
Changes from v7:
Limits io_uring per-io hints to raw block, and only if the block
device registers a new queue limit indicating support for it.
The per-io hints are opaque to the kernel.
Minor changelog and code organization changes.
I don't really understand the io_uring suggestions, so I just made the
write_hint a first class field without the "meta" indirection. It's
kind of like ioprio, which has it's own field too. Actually, might be
neat if we could use ioprio since it already has a "hints" field that
is currently only used by command duration limits.
Kanchan Joshi (3):
block, fs: restore kiocb based write hint processing
io_uring: enable per-io hinting capability
nvme: enable FDP support
Keith Busch (3):
block: use generic u16 for write hints
block: introduce max_write_hints queue limit
fs: introduce per-io hint support flag
Documentation/ABI/stable/sysfs-block | 7 +++
block/blk-settings.c | 3 +
block/blk-sysfs.c | 3 +
block/fops.c | 10 ++--
drivers/nvme/host/core.c | 82 ++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 5 ++
fs/aio.c | 1 +
fs/cachefiles/io.c | 1 +
fs/direct-io.c | 2 +-
fs/iomap/direct-io.c | 2 +-
include/linux/blk-mq.h | 3 +-
include/linux/blk_types.h | 2 +-
include/linux/blkdev.h | 12 ++++
include/linux/fs.h | 10 ++++
include/linux/nvme.h | 19 +++++++
include/uapi/linux/io_uring.h | 4 ++
io_uring/rw.c | 10 +++-
17 files changed, 166 insertions(+), 10 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 5:50 ` Christoph Hellwig
2024-10-18 16:11 ` Bart Van Assche
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
` (4 subsequent siblings)
5 siblings, 2 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke, Keith Busch
From: Kanchan Joshi <[email protected]>
struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
("fs: remove kiocb.ki_hint"). But write hint returned with commit
449813515d3e ("block, fs: Restore the per-bio/request data lifetime
fields").
This patch uses the leftover space in kiocb to carve 2 byte field
ki_write_hint. Restore the code that operates on kiocb to use
ki_write_hint instead of inode hint value.
This does not change any behavior, but needed to enable per-io hints.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
block/fops.c | 6 +++---
fs/aio.c | 1 +
fs/cachefiles/io.c | 1 +
fs/direct-io.c | 2 +-
fs/iomap/direct-io.c | 2 +-
include/linux/fs.h | 8 ++++++++
io_uring/rw.c | 1 +
7 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index e696ae53bf1e0..85b9b97d372c8 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio.bi_write_hint = iocb->ki_write_hint;
bio.bi_ioprio = iocb->ki_ioprio;
if (iocb->ki_flags & IOCB_ATOMIC)
bio.bi_opf |= REQ_ATOMIC;
@@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
for (;;) {
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
dio->flags = 0;
dio->iocb = iocb;
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
- bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = iocb->ki_write_hint;
bio->bi_end_io = blkdev_bio_end_io_async;
bio->bi_ioprio = iocb->ki_ioprio;
diff --git a/fs/aio.c b/fs/aio.c
index e8920178b50f7..db618817e670d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
+ req->ki_write_hint = file_write_hint(req->ki_filp);
if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
/*
* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 6a821a959b59e..c3db102ae64e2 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object,
ki->iocb.ki_pos = start_pos;
ki->iocb.ki_flags = IOCB_DIRECT | IOCB_WRITE;
ki->iocb.ki_ioprio = get_current_ioprio();
+ ki->iocb.ki_write_hint = file_write_hint(file);
ki->object = object;
ki->start = start_pos;
ki->len = len;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index bbd05f1a21453..73629e26becbe 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
bio->bi_end_io = dio_bio_end_io;
if (dio->is_pinned)
bio_set_flag(bio, BIO_PAGE_PINNED);
- bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
+ bio->bi_write_hint = dio->iocb->ki_write_hint;
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f637aa0706a31..fff43f121ee65 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
GFP_KERNEL);
bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
- bio->bi_write_hint = inode->i_write_hint;
+ bio->bi_write_hint = dio->iocb->ki_write_hint;
bio->bi_ioprio = dio->iocb->ki_ioprio;
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c15..04e875a37f604 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
void *private;
int ki_flags;
u16 ki_ioprio; /* See linux/ioprio.h */
+ u16 ki_write_hint;
union {
/*
* Only used for async buffered reads, where it denotes the
@@ -2337,12 +2338,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
!vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
}
+static inline enum rw_hint file_write_hint(struct file *filp)
+{
+ return file_inode(filp)->i_write_hint;
+}
+
static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
{
*kiocb = (struct kiocb) {
.ki_filp = filp,
.ki_flags = filp->f_iocb_flags,
.ki_ioprio = get_current_ioprio(),
+ .ki_write_hint = file_write_hint(filp),
};
}
@@ -2353,6 +2360,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
.ki_filp = filp,
.ki_flags = kiocb_src->ki_flags,
.ki_ioprio = kiocb_src->ki_ioprio,
+ .ki_write_hint = kiocb_src->ki_write_hint,
.ki_pos = kiocb_src->ki_pos,
};
}
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 80ae3c2ebb70c..ffd637ca0bd17 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1027,6 +1027,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret))
return ret;
req->cqe.res = iov_iter_count(&io->iter);
+ rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
if (force_nonblock) {
/* If the file doesn't support async, just async punt */
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 5:46 ` Christoph Hellwig
` (2 more replies)
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
` (3 subsequent siblings)
5 siblings, 3 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
From: Keith Busch <[email protected]>
This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/blk-mq.h | 3 +--
include/linux/blk_types.h | 2 +-
2 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 59e9adf815a49..bf007a4081d9b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,7 +8,6 @@
#include <linux/scatterlist.h>
#include <linux/prefetch.h>
#include <linux/srcu.h>
-#include <linux/rw_hint.h>
struct blk_mq_tags;
struct blk_flush_queue;
@@ -156,7 +155,7 @@ struct request {
struct blk_crypto_keyslot *crypt_keyslot;
#endif
- enum rw_hint write_hint;
+ unsigned short write_hint;
unsigned short ioprio;
enum mq_rq_state state;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..56b7fb961e0c7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -219,7 +219,7 @@ struct bio {
*/
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
- enum rw_hint bi_write_hint;
+ unsigned short bi_write_hint;
blk_status_t bi_status;
atomic_t __bi_remaining;
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCHv8 3/6] block: introduce max_write_hints queue limit
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 5:51 ` Christoph Hellwig
` (2 more replies)
2024-10-17 16:09 ` [PATCHv8 4/6] fs: introduce per-io hint support flag Keith Busch
` (2 subsequent siblings)
5 siblings, 3 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
From: Keith Busch <[email protected]>
Drivers with hardware that support write hints need a way to export how
many are available so applications can generically query this.
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 7 +++++++
block/blk-settings.c | 3 +++
block/blk-sysfs.c | 3 +++
block/fops.c | 2 ++
include/linux/blkdev.h | 12 ++++++++++++
5 files changed, 27 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 8353611107154..f2db2cabb8e75 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
+What: /sys/block/<disk>/queue/max_write_hints
+Date: October 2024
+Contact: [email protected]
+Description:
+ [RO] Maximum number of write hints supported, 0 if not
+ supported. If supported, valid values are 1 through
+ max_write_hints, inclusive.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a446654ddee5e..921fb4d334fa4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
/* Inherit limits from component devices */
+ lim->max_write_hints = USHRT_MAX;
lim->max_segments = USHRT_MAX;
lim->max_discard_segments = USHRT_MAX;
lim->max_hw_sectors = UINT_MAX;
@@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->max_segment_size = min_not_zero(t->max_segment_size,
b->max_segment_size);
+ t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
+
alignment = queue_limit_alignment_offset(b, start);
/* Bottom device has different alignment. Check that it is
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741b95dfdbf6f..85f48ca461049 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_hints)
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -457,6 +458,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_hints, "max_write_hints");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -591,6 +593,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_discard_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
+ &queue_max_write_hints_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/block/fops.c b/block/fops.c
index 85b9b97d372c8..d0b16d3975fd6 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -376,6 +376,8 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (blkdev_dio_invalid(bdev, iocb->ki_pos, iter, is_atomic))
return -EINVAL;
+ if (iocb->ki_write_hint > bdev_max_write_hints(bdev))
+ return -EINVAL;
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
if (likely(nr_pages <= BIO_MAX_VECS)) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6b78a68e0bd9c..01aba0ffeff6e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,6 +393,8 @@ struct queue_limits {
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_write_hints;
+
unsigned int max_open_zones;
unsigned int max_active_zones;
@@ -1183,6 +1185,11 @@ static inline unsigned short queue_max_segments(const struct request_queue *q)
return q->limits.max_segments;
}
+static inline unsigned short queue_max_write_hints(struct request_queue *q)
+{
+ return q->limits.max_write_hints;
+}
+
static inline unsigned short queue_max_discard_segments(const struct request_queue *q)
{
return q->limits.max_discard_segments;
@@ -1230,6 +1237,11 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
return queue_max_segments(bdev_get_queue(bdev));
}
+static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
+{
+ return queue_max_write_hints(bdev_get_queue(bdev));
+}
+
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
return q->limits.logical_block_size;
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCHv8 4/6] fs: introduce per-io hint support flag
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
` (2 preceding siblings ...)
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 5:52 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
2024-10-17 16:09 ` [PATCHv8 5/6] io_uring: enable per-io hinting capability Keith Busch
2024-10-17 16:09 ` [PATCHv8 6/6] nvme: enable FDP support Keith Busch
5 siblings, 2 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
From: Keith Busch <[email protected]>
A block device may support write hints on a per-io basis. The raw block
file operations can effectively use these, but real filesystems are not
ready to make use of this. Provide a file_operations flag to indicate
support, and set it for the block file operations.
Signed-off-by: Keith Busch <[email protected]>
---
block/fops.c | 2 +-
include/linux/fs.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/block/fops.c b/block/fops.c
index d0b16d3975fd6..15a63e26161ea 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -869,7 +869,7 @@ const struct file_operations def_blk_fops = {
.splice_write = iter_file_splice_write,
.fallocate = blkdev_fallocate,
.uring_cmd = blkdev_uring_cmd,
- .fop_flags = FOP_BUFFER_RASYNC,
+ .fop_flags = FOP_BUFFER_RASYNC | FOP_PER_IO_HINTS,
};
static __init int blkdev_init(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 04e875a37f604..026dc9801dc20 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2117,6 +2117,8 @@ struct file_operations {
#define FOP_HUGE_PAGES ((__force fop_flags_t)(1 << 4))
/* Treat loff_t as unsigned (e.g., /dev/mem) */
#define FOP_UNSIGNED_OFFSET ((__force fop_flags_t)(1 << 5))
+/* File system can handle per-io hints */
+#define FOP_PER_IO_HINTS ((__force fop_flags_t)(1 << 6))
/* Wrap a directory iterator that needs exclusive inode access */
int wrap_directory_iterator(struct file *, struct dir_context *,
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCHv8 5/6] io_uring: enable per-io hinting capability
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
` (3 preceding siblings ...)
2024-10-17 16:09 ` [PATCHv8 4/6] fs: introduce per-io hint support flag Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 5:53 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
2024-10-17 16:09 ` [PATCHv8 6/6] nvme: enable FDP support Keith Busch
5 siblings, 2 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty, Keith Busch
From: Kanchan Joshi <[email protected]>
With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes can be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.
Allow userspace to pass additional metadata in the SQE.
__u16 write_hint;
This accepts all hint values that the file allows.
The write handlers (io_prep_rw, io_write) send the hint value to
lower-layer using kiocb. This is good for upporting direct IO, but not
when kiocb is not available (e.g., buffered IO).
When per-io hints are not passed, the per-inode hint values are set in
the kiocb (as before). Otherwise, per-io hints take the precedence over
per-inode hints.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/uapi/linux/io_uring.h | 4 ++++
io_uring/rw.c | 11 +++++++++--
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b53..bd9acc0053318 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@ struct io_uring_sqe {
__u16 addr_len;
__u16 __pad3[1];
};
+ struct {
+ __u16 write_hint;
+ __u16 __pad4[1];
+ };
};
union {
struct {
diff --git a/io_uring/rw.c b/io_uring/rw.c
index ffd637ca0bd17..9a6d3ba76af4f 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -279,7 +279,11 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->kiocb.ki_ioprio = get_current_ioprio();
}
rw->kiocb.dio_complete = NULL;
-
+ if (ddir == ITER_SOURCE &&
+ req->file->f_op->fop_flags & FOP_PER_IO_HINTS)
+ rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
+ else
+ rw->kiocb.ki_write_hint = WRITE_LIFE_NOT_SET;
rw->addr = READ_ONCE(sqe->addr);
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
@@ -1027,7 +1031,10 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret))
return ret;
req->cqe.res = iov_iter_count(&io->iter);
- rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
+
+ /* Use per-file hint only if per-io hint is not set. */
+ if (rw->kiocb.ki_write_hint == WRITE_LIFE_NOT_SET)
+ rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
if (force_nonblock) {
/* If the file doesn't support async, just async punt */
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCHv8 6/6] nvme: enable FDP support
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
` (4 preceding siblings ...)
2024-10-17 16:09 ` [PATCHv8 5/6] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-17 16:09 ` Keith Busch
2024-10-18 10:48 ` Kanchan Joshi
5 siblings, 1 reply; 31+ messages in thread
From: Keith Busch @ 2024-10-17 16:09 UTC (permalink / raw)
To: linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Hui Qi, Nitesh Shetty,
Hannes Reinecke, Keith Busch
From: Kanchan Joshi <[email protected]>
Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.
Userspace can send the write hint information using io_uring or fcntl.
Fetch the placement-identifiers if the device supports FDP. The incoming
write-hint is mapped to a placement-identifier, which in turn is set in
the DSPEC field of the write command.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Nacked-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 82 ++++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 5 +++
include/linux/nvme.h | 19 ++++++++++
3 files changed, 106 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 43d73d31c66f3..02a36032c835f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
bool is_removed;
};
+struct nvme_fdp_ruh_status_desc {
+ u16 pid;
+ u16 ruhid;
+ u32 earutr;
+ u64 ruamw;
+ u8 rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+ u8 rsvd0[14];
+ __le16 nruhsd;
+ struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
unsigned int admin_timeout = 60;
module_param(admin_timeout, uint, 0644);
MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
ida_free(&head->subsys->ns_ida, head->instance);
cleanup_srcu_struct(&head->srcu);
nvme_put_subsystem(head->subsys);
+ kfree(head->plids);
kfree(head);
}
@@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+ if (req->write_hint != WRITE_LIFE_NOT_SET && ns->head->nr_plids) {
+ u16 hint = max(req->write_hint, ns->head->nr_plids);
+
+ dsmgmt |= ns->head->plids[hint - 1] << 16;
+ control |= NVME_RW_DTYPE_DPLCMT;
+ }
+
if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
return BLK_STS_INVAL;
@@ -2114,6 +2136,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
return ret;
}
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+ struct nvme_fdp_ruh_status_desc *ruhsd;
+ struct nvme_ns_head *head = ns->head;
+ struct nvme_fdp_ruh_status *ruhs;
+ struct nvme_command c = {};
+ int size, ret, i;
+
+ if (head->plids)
+ return 0;
+
+ size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+ ruhs = kzalloc(size, GFP_KERNEL);
+ if (!ruhs)
+ return -ENOMEM;
+
+ c.imr.opcode = nvme_cmd_io_mgmt_recv;
+ c.imr.nsid = cpu_to_le32(nsid);
+ c.imr.mo = 0x1;
+ c.imr.numd = cpu_to_le32((size >> 2) - 1);
+
+ ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+ if (ret)
+ goto out;
+
+ i = le16_to_cpu(ruhs->nruhsd);
+ if (!i)
+ goto out;
+
+ ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
+ head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
+ GFP_KERNEL);
+ if (!head->plids) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ for (i = 0; i < ns->head->nr_plids; i++) {
+ ruhsd = &ruhs->ruhsd[i];
+ head->plids[i] = le16_to_cpu(ruhsd->pid);
+ }
+out:
+ kfree(ruhs);
+ return ret;
+}
+
static int nvme_update_ns_info_block(struct nvme_ns *ns,
struct nvme_ns_info *info)
{
@@ -2150,6 +2218,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
goto out;
}
+ if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+ ret = nvme_fetch_fdp_plids(ns, info->nsid);
+ if (ret)
+ dev_warn(ns->ctrl->device,
+ "FDP failure status:0x%x\n", ret);
+ if (ret < 0)
+ goto out;
+ } else {
+ ns->head->nr_plids = 0;
+ kfree(ns->head->plids);
+ ns->head->plids = NULL;
+ }
+
blk_mq_freeze_queue(ns->disk->queue);
ns->head->lba_shift = id->lbaf[lbaf].ds;
ns->head->nuse = le64_to_cpu(id->nuse);
@@ -2180,6 +2261,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
if (!nvme_init_integrity(ns->head, &lim, info))
capacity = 0;
+ lim.max_write_hints = ns->head->nr_plids;
ret = queue_limits_commit_update(ns->disk->queue, &lim);
if (ret) {
blk_mq_unfreeze_queue(ns->disk->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 313a4f978a2cf..d6b516c0e502c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
u8 csi;
};
+#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16))
+
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,9 @@ struct nvme_ns_head {
struct device cdev_device;
struct gendisk *disk;
+
+ u16 nr_plids;
+ u16 *plids;
#ifdef CONFIG_NVME_MULTIPATH
struct bio_list requeue_list;
spinlock_t requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e0..a954eaee5b0f3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
NVME_CTRL_ATTR_HID_128_BIT = (1 << 0),
NVME_CTRL_ATTR_TBKAS = (1 << 6),
NVME_CTRL_ATTR_ELBAS = (1 << 15),
+ NVME_CTRL_ATTR_FDPS = (1 << 19),
};
struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
nvme_cmd_resv_register = 0x0d,
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
+ nvme_cmd_io_mgmt_recv = 0x12,
nvme_cmd_resv_release = 0x15,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_register), \
nvme_opcode_name(nvme_cmd_resv_report), \
nvme_opcode_name(nvme_cmd_resv_acquire), \
+ nvme_opcode_name(nvme_cmd_io_mgmt_recv), \
nvme_opcode_name(nvme_cmd_resv_release), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
@@ -1015,6 +1018,7 @@ enum {
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
NVME_RW_DTYPE_STREAMS = 1 << 4,
+ NVME_RW_DTYPE_DPLCMT = 2 << 4,
NVME_WZ_DEAC = 1 << 9,
};
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
__le32 cdw14[2];
};
+struct nvme_io_mgmt_recv_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __u8 mo;
+ __u8 rsvd11;
+ __u16 mos;
+ __le32 numd;
+ __le32 cdw12[4];
+};
+
enum {
NVME_ZRA_ZONE_REPORT = 0,
NVME_ZRASF_ZONE_REPORT_ALL = 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
struct nvmf_auth_receive_command auth_receive;
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
+ struct nvme_io_mgmt_recv_cmd imr;
};
};
--
2.43.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
@ 2024-10-18 5:46 ` Christoph Hellwig
2024-10-24 19:45 ` Keith Busch
2024-10-18 6:00 ` Hannes Reinecke
2024-10-18 16:12 ` Bart Van Assche
2 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-18 5:46 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, axboe, hch, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Keith Busch
On Thu, Oct 17, 2024 at 09:09:33AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition.
So in the end we'll end up with two uses of it - the existing 5
temperature hints and the new stream separation. I think it
would be cleaner to make it a union, but I don't care that
strongly.
But we probably want a way to distinguish which one is supported.
E.g. for SCSI we set a net BLK_FEAT_WRITE_HINTS, for NVMe we'll set
BLK_FEAT_STREAM_SEPARATION.
Either way this should probably be the first patch in the series.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
@ 2024-10-18 5:50 ` Christoph Hellwig
2024-10-21 15:47 ` Keith Busch
2024-10-18 16:11 ` Bart Van Assche
1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-18 5:50 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, axboe, hch, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Nitesh Shetty, Hannes Reinecke, Keith Busch
On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> From: Kanchan Joshi <[email protected]>
>
> struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
> ("fs: remove kiocb.ki_hint"). But write hint returned with commit
> 449813515d3e ("block, fs: Restore the per-bio/request data lifetime
> fields").
>
> This patch uses the leftover space in kiocb to carve 2 byte field
> ki_write_hint. Restore the code that operates on kiocb to use
> ki_write_hint instead of inode hint value.
>
> This does not change any behavior, but needed to enable per-io hints.
In this version it doesn't really restore anything, but adds a new
write hinting capability. Similarly to the bio patch we'll probably
need to make clear what is in there instead of having it completely
untyped (the exact same appraoch as for the bio should work).
> index bbd05f1a21453..73629e26becbe 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
> bio->bi_end_io = dio_bio_end_io;
> if (dio->is_pinned)
> bio_set_flag(bio, BIO_PAGE_PINNED);
> - bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
> + bio->bi_write_hint = dio->iocb->ki_write_hint;
>
> sdio->bio = bio;
> sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f637aa0706a31..fff43f121ee65 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> GFP_KERNEL);
> bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> - bio->bi_write_hint = inode->i_write_hint;
> + bio->bi_write_hint = dio->iocb->ki_write_hint;
File system (helper) code should not directly apply this limit,
but the file system needs to set it.
> +static inline enum rw_hint file_write_hint(struct file *filp)
> +{
> + return file_inode(filp)->i_write_hint;
> +}
> +
> static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
> {
> *kiocb = (struct kiocb) {
> .ki_filp = filp,
> .ki_flags = filp->f_iocb_flags,
> .ki_ioprio = get_current_ioprio(),
> + .ki_write_hint = file_write_hint(filp),
And we'll need to distinguish between the per-inode and per file
hint. I.e. don't blindly initialize ki_write_hint to the per-inode
one here, but make that conditional in the file operation.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 3/6] block: introduce max_write_hints queue limit
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
@ 2024-10-18 5:51 ` Christoph Hellwig
2024-10-18 6:01 ` Hannes Reinecke
2024-10-18 16:18 ` Bart Van Assche
2 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-18 5:51 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, axboe, hch, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Keith Busch
On Thu, Oct 17, 2024 at 09:09:34AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Drivers with hardware that support write hints need a way to export how
> many are available so applications can generically query this.
Calling this write hints vs write streams is very confusing.
Otherwise this looks reasonable.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 4/6] fs: introduce per-io hint support flag
2024-10-17 16:09 ` [PATCHv8 4/6] fs: introduce per-io hint support flag Keith Busch
@ 2024-10-18 5:52 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-18 5:52 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, axboe, hch, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Keith Busch
On Thu, Oct 17, 2024 at 09:09:35AM -0700, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> A block device may support write hints on a per-io basis. The raw block
> file operations can effectively use these, but real filesystems are not
> ready to make use of this. Provide a file_operations flag to indicate
> support, and set it for the block file operations.
I'm a little worried about the overly generic "hint" think again. Make
it very explicit about the write streams and this make sense.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 5/6] io_uring: enable per-io hinting capability
2024-10-17 16:09 ` [PATCHv8 5/6] io_uring: enable per-io hinting capability Keith Busch
@ 2024-10-18 5:53 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-18 5:53 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, linux-nvme, axboe, hch, io-uring, linux-fsdevel,
joshi.k, javier.gonz, Nitesh Shetty, Keith Busch
Same hint vs write stream thing here as well.
> + if (ddir == ITER_SOURCE &&
> + req->file->f_op->fop_flags & FOP_PER_IO_HINTS)
> + rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
> + else
> + rw->kiocb.ki_write_hint = WRITE_LIFE_NOT_SET;
WRITE_LIFE_NOT_SET is in the wrong namespae vs the separate streams.
Either use 0 directly or add a separate constant for it.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
2024-10-18 5:46 ` Christoph Hellwig
@ 2024-10-18 6:00 ` Hannes Reinecke
2024-10-18 16:12 ` Bart Van Assche
2 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-10-18 6:00 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
On 10/17/24 18:09, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/blk-mq.h | 3 +--
> include/linux/blk_types.h | 2 +-
> 2 files changed, 2 insertions(+), 3 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 3/6] block: introduce max_write_hints queue limit
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
2024-10-18 5:51 ` Christoph Hellwig
@ 2024-10-18 6:01 ` Hannes Reinecke
2024-10-18 16:18 ` Bart Van Assche
2 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-10-18 6:01 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
On 10/17/24 18:09, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Drivers with hardware that support write hints need a way to export how
> many are available so applications can generically query this.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> Documentation/ABI/stable/sysfs-block | 7 +++++++
> block/blk-settings.c | 3 +++
> block/blk-sysfs.c | 3 +++
> block/fops.c | 2 ++
> include/linux/blkdev.h | 12 ++++++++++++
> 5 files changed, 27 insertions(+)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 4/6] fs: introduce per-io hint support flag
2024-10-17 16:09 ` [PATCHv8 4/6] fs: introduce per-io hint support flag Keith Busch
2024-10-18 5:52 ` Christoph Hellwig
@ 2024-10-18 6:03 ` Hannes Reinecke
1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-10-18 6:03 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
On 10/17/24 18:09, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> A block device may support write hints on a per-io basis. The raw block
> file operations can effectively use these, but real filesystems are not
> ready to make use of this. Provide a file_operations flag to indicate
> support, and set it for the block file operations.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> block/fops.c | 2 +-
> include/linux/fs.h | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 5/6] io_uring: enable per-io hinting capability
2024-10-17 16:09 ` [PATCHv8 5/6] io_uring: enable per-io hinting capability Keith Busch
2024-10-18 5:53 ` Christoph Hellwig
@ 2024-10-18 6:03 ` Hannes Reinecke
1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2024-10-18 6:03 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty, Keith Busch
On 10/17/24 18:09, Keith Busch wrote:
> From: Kanchan Joshi <[email protected]>
>
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down. This
> can be limiting for block device as all the writes can be tagged with
> only one lifetime hint value. Concurrent writes (with different hint
> values) are hard to manage. Per-IO hinting solves that problem.
>
> Allow userspace to pass additional metadata in the SQE.
>
> __u16 write_hint;
>
> This accepts all hint values that the file allows.
>
> The write handlers (io_prep_rw, io_write) send the hint value to
> lower-layer using kiocb. This is good for upporting direct IO, but not
> when kiocb is not available (e.g., buffered IO).
>
> When per-io hints are not passed, the per-inode hint values are set in
> the kiocb (as before). Otherwise, per-io hints take the precedence over
> per-inode hints.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/uapi/linux/io_uring.h | 4 ++++
> io_uring/rw.c | 11 +++++++++--
> 2 files changed, 13 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 6/6] nvme: enable FDP support
2024-10-17 16:09 ` [PATCHv8 6/6] nvme: enable FDP support Keith Busch
@ 2024-10-18 10:48 ` Kanchan Joshi
2024-10-21 15:08 ` Keith Busch
0 siblings, 1 reply; 31+ messages in thread
From: Kanchan Joshi @ 2024-10-18 10:48 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, javier.gonz, Hui Qi, Nitesh Shetty,
Hannes Reinecke, Keith Busch
On 10/17/2024 9:39 PM, Keith Busch wrote:
>
> +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16))
> +
Seems you intended sizeof(u16).
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
2024-10-18 5:50 ` Christoph Hellwig
@ 2024-10-18 16:11 ` Bart Van Assche
1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2024-10-18 16:11 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke, Keith Busch
On 10/17/24 9:09 AM, Keith Busch wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3559446279c15..04e875a37f604 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -370,6 +370,7 @@ struct kiocb {
> void *private;
> int ki_flags;
> u16 ki_ioprio; /* See linux/ioprio.h */
> + u16 ki_write_hint;
> union {
> /*
> * Only used for async buffered reads, where it denotes the
Why 'u16'? Shouldn't this patch use the type 'enum rw_hint' for
ki_write_hint and shouldn't this type be changed into u16 in a later patch?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
2024-10-18 5:46 ` Christoph Hellwig
2024-10-18 6:00 ` Hannes Reinecke
@ 2024-10-18 16:12 ` Bart Van Assche
2024-10-21 15:03 ` Keith Busch
2 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2024-10-18 16:12 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
On 10/17/24 9:09 AM, Keith Busch wrote:
> @@ -156,7 +155,7 @@ struct request {
> struct blk_crypto_keyslot *crypt_keyslot;
> #endif
>
> - enum rw_hint write_hint;
> + unsigned short write_hint;
Why 'u16' for ki_write_hint and 'unsigned short' for write_hint? Isn't
that inconsistent?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 3/6] block: introduce max_write_hints queue limit
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
2024-10-18 5:51 ` Christoph Hellwig
2024-10-18 6:01 ` Hannes Reinecke
@ 2024-10-18 16:18 ` Bart Van Assche
2024-10-21 15:02 ` Keith Busch
2 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2024-10-18 16:18 UTC (permalink / raw)
To: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring
Cc: linux-fsdevel, joshi.k, javier.gonz, Keith Busch
On 10/17/24 9:09 AM, Keith Busch wrote:
> Drivers with hardware that support write hints need a way to export how
> many are available so applications can generically query this.
Something is missing from this patch, namely a change for the SCSI disk
(sd) driver that sets max_write_hints to sdkp->permanent_stream_count.
> +What: /sys/block/<disk>/queue/max_write_hints
> +Date: October 2024
> +Contact: [email protected]
> +Description:
> + [RO] Maximum number of write hints supported, 0 if not
> + supported. If supported, valid values are 1 through
> + max_write_hints, inclusive.
That's a bit short. I think it would help to add a reference to the
aspects of the standards related to this attribute: permanent streams
for SCSI and FDP for NVMe.
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a446654ddee5e..921fb4d334fa4 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
> lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
>
> /* Inherit limits from component devices */
> + lim->max_write_hints = USHRT_MAX;
> lim->max_segments = USHRT_MAX;
> lim->max_discard_segments = USHRT_MAX;
> lim->max_hw_sectors = UINT_MAX;
> @@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> t->max_segment_size = min_not_zero(t->max_segment_size,
> b->max_segment_size);
>
> + t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
> +
> alignment = queue_limit_alignment_offset(b, start);
>
I prefer that lim->max_write_hints is initialized to zero in
blk_set_stacking_limits() and that blk_stack_limits() uses
min_not_zero().
Thanks,
Bart.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 3/6] block: introduce max_write_hints queue limit
2024-10-18 16:18 ` Bart Van Assche
@ 2024-10-21 15:02 ` Keith Busch
0 siblings, 0 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-21 15:02 UTC (permalink / raw)
To: Bart Van Assche
Cc: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring,
linux-fsdevel, joshi.k, javier.gonz
On Fri, Oct 18, 2024 at 09:18:34AM -0700, Bart Van Assche wrote:
> On 10/17/24 9:09 AM, Keith Busch wrote:
> > Drivers with hardware that support write hints need a way to export how
> > many are available so applications can generically query this.
>
> Something is missing from this patch, namely a change for the SCSI disk
> (sd) driver that sets max_write_hints to sdkp->permanent_stream_count.
Shouldn't someone who cares about scsi do that? I certainly don't care,
nor have I been keeping up with what's happening there, so I'm also
unqualified.
> > +What: /sys/block/<disk>/queue/max_write_hints
> > +Date: October 2024
> > +Contact: [email protected]
> > +Description:
> > + [RO] Maximum number of write hints supported, 0 if not
> > + supported. If supported, valid values are 1 through
> > + max_write_hints, inclusive.
>
> That's a bit short. I think it would help to add a reference to the
> aspects of the standards related to this attribute: permanent streams
> for SCSI and FDP for NVMe.
The specs regarding write hints have not historically been stable, so
I'd rather not tie kernel docs to volatile external specifications.
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index a446654ddee5e..921fb4d334fa4 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -43,6 +43,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
> > lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
> > /* Inherit limits from component devices */
> > + lim->max_write_hints = USHRT_MAX;
> > lim->max_segments = USHRT_MAX;
> > lim->max_discard_segments = USHRT_MAX;
> > lim->max_hw_sectors = UINT_MAX;
> > @@ -544,6 +545,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> > t->max_segment_size = min_not_zero(t->max_segment_size,
> > b->max_segment_size);
> > + t->max_write_hints = min(t->max_write_hints, b->max_write_hints);
> > +
> > alignment = queue_limit_alignment_offset(b, start);
>
> I prefer that lim->max_write_hints is initialized to zero in
> blk_set_stacking_limits() and that blk_stack_limits() uses
> min_not_zero().
How is a device supposed to report it doesn't support a write hint if 0
gets overridden?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-18 16:12 ` Bart Van Assche
@ 2024-10-21 15:03 ` Keith Busch
0 siblings, 0 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-21 15:03 UTC (permalink / raw)
To: Bart Van Assche
Cc: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring,
linux-fsdevel, joshi.k, javier.gonz
On Fri, Oct 18, 2024 at 09:12:20AM -0700, Bart Van Assche wrote:
> On 10/17/24 9:09 AM, Keith Busch wrote:
> > @@ -156,7 +155,7 @@ struct request {
> > struct blk_crypto_keyslot *crypt_keyslot;
> > #endif
> > - enum rw_hint write_hint;
> > + unsigned short write_hint;
>
> Why 'u16' for ki_write_hint and 'unsigned short' for write_hint? Isn't
> that inconsistent?
It's consistent with the local convetion of the existing structs. Some
use unsiged short, others use u16. It looks weird to mix the types
within the same struct.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 6/6] nvme: enable FDP support
2024-10-18 10:48 ` Kanchan Joshi
@ 2024-10-21 15:08 ` Keith Busch
0 siblings, 0 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-21 15:08 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Keith Busch, linux-block, linux-nvme, axboe, hch, io-uring,
linux-fsdevel, javier.gonz, Hui Qi, Nitesh Shetty,
Hannes Reinecke
On Fri, Oct 18, 2024 at 04:18:17PM +0530, Kanchan Joshi wrote:
> On 10/17/2024 9:39 PM, Keith Busch wrote:
> >
> > +#define NVME_MAX_PLIDS (NVME_CTRL_PAGE_SIZE / sizeof(16))
> > +
>
> Seems you intended sizeof(u16).
Ha, yes, that's the intended type. I don't have anything close to 1k
placement hints, so would have never noticed this was the wrong size.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-18 5:50 ` Christoph Hellwig
@ 2024-10-21 15:47 ` Keith Busch
2024-10-21 17:09 ` Bart Van Assche
2024-10-22 6:43 ` Christoph Hellwig
0 siblings, 2 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-21 15:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, axboe, io-uring,
linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > {
> > *kiocb = (struct kiocb) {
> > .ki_filp = filp,
> > .ki_flags = filp->f_iocb_flags,
> > .ki_ioprio = get_current_ioprio(),
> > + .ki_write_hint = file_write_hint(filp),
>
> And we'll need to distinguish between the per-inode and per file
> hint. I.e. don't blindly initialize ki_write_hint to the per-inode
> one here, but make that conditional in the file operation.
Maybe someone wants to do direct-io with partions where each partition
has a different default "hint" when not provided a per-io hint? I don't
know of such a case, but it doesn't sound terrible. In any case, I feel
if you're directing writes through these interfaces, you get to keep all
the pieces: user space controls policy, kernel just provides the
mechanisms to do it.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-21 15:47 ` Keith Busch
@ 2024-10-21 17:09 ` Bart Van Assche
2024-10-21 19:35 ` Keith Busch
2024-10-22 6:43 ` Christoph Hellwig
1 sibling, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2024-10-21 17:09 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, axboe, io-uring,
linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On 10/21/24 8:47 AM, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
>> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
>>> {
>>> *kiocb = (struct kiocb) {
>>> .ki_filp = filp,
>>> .ki_flags = filp->f_iocb_flags,
>>> .ki_ioprio = get_current_ioprio(),
>>> + .ki_write_hint = file_write_hint(filp),
>>
>> And we'll need to distinguish between the per-inode and per file
>> hint. I.e. don't blindly initialize ki_write_hint to the per-inode
>> one here, but make that conditional in the file operation.
>
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.
Is it important to support partitions on top of FDP namespaces? We could
follow the example of zoned block devices and not support partitions on
top of FDP devices. From block/core.c, function add_partition():
/*
* Partitions are not supported on zoned block devices that are used as
* such.
*/
if (bdev_is_zoned(disk->part0)) {
pr_warn("%s: partitions not supported on host managed zoned block
device\n",
disk->disk_name);
return ERR_PTR(-ENXIO);
}
Thanks,
Bart.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-21 17:09 ` Bart Van Assche
@ 2024-10-21 19:35 ` Keith Busch
0 siblings, 0 replies; 31+ messages in thread
From: Keith Busch @ 2024-10-21 19:35 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
io-uring, linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On Mon, Oct 21, 2024 at 10:09:57AM -0700, Bart Van Assche wrote:
> On 10/21/24 8:47 AM, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > > {
> > > > *kiocb = (struct kiocb) {
> > > > .ki_filp = filp,
> > > > .ki_flags = filp->f_iocb_flags,
> > > > .ki_ioprio = get_current_ioprio(),
> > > > + .ki_write_hint = file_write_hint(filp),
> > >
> > > And we'll need to distinguish between the per-inode and per file
> > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> >
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
>
> Is it important to support partitions on top of FDP namespaces?
It's already used with partitions, so yes, it's important. Breaking that
as a condition to make it work with the block stack is a non-starter.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-21 15:47 ` Keith Busch
2024-10-21 17:09 ` Bart Van Assche
@ 2024-10-22 6:43 ` Christoph Hellwig
2024-10-22 14:37 ` Keith Busch
1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-22 6:43 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
io-uring, linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > {
> > > *kiocb = (struct kiocb) {
> > > .ki_filp = filp,
> > > .ki_flags = filp->f_iocb_flags,
> > > .ki_ioprio = get_current_ioprio(),
> > > + .ki_write_hint = file_write_hint(filp),
> >
> > And we'll need to distinguish between the per-inode and per file
> > hint. I.e. don't blindly initialize ki_write_hint to the per-inode
> > one here, but make that conditional in the file operation.
>
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.
Eww. You actually pointed out a real problem here: if a device
has multiple partitions the write streams as of this series are
shared by them, which breaks their use case as the applications or
file systems in different partitions will get other users of the
write stream randomly overlayed onto theirs.
So either the available streams need to be split into smaller pools
by partitions, or we just assigned them to the first partition to
make these scheme work for partitioned devices.
Either way mixing up the per-inode hint and the dynamic one remains
a bad idea.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-22 6:43 ` Christoph Hellwig
@ 2024-10-22 14:37 ` Keith Busch
2024-10-22 14:40 ` Christoph Hellwig
0 siblings, 1 reply; 31+ messages in thread
From: Keith Busch @ 2024-10-22 14:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, axboe, io-uring,
linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On Tue, Oct 22, 2024 at 08:43:09AM +0200, Christoph Hellwig wrote:
> On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > > {
> > > > *kiocb = (struct kiocb) {
> > > > .ki_filp = filp,
> > > > .ki_flags = filp->f_iocb_flags,
> > > > .ki_ioprio = get_current_ioprio(),
> > > > + .ki_write_hint = file_write_hint(filp),
> > >
> > > And we'll need to distinguish between the per-inode and per file
> > > hint. I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> >
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
>
> Eww. You actually pointed out a real problem here: if a device
> has multiple partitions the write streams as of this series are
> shared by them, which breaks their use case as the applications or
> file systems in different partitions will get other users of the
> write stream randomly overlayed onto theirs.
>
> So either the available streams need to be split into smaller pools
> by partitions, or we just assigned them to the first partition to
> make these scheme work for partitioned devices.
>
> Either way mixing up the per-inode hint and the dynamic one remains
> a bad idea.
No doubt it's almost certainly not a good idea to mix different stream
usages, but that's not the kernels problem. It's user space policy. I
don't think the kernel needs to perform any heroic efforts to split
anything here. Just keep it simple.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 1/6] block, fs: restore kiocb based write hint processing
2024-10-22 14:37 ` Keith Busch
@ 2024-10-22 14:40 ` Christoph Hellwig
0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-22 14:40 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
io-uring, linux-fsdevel, joshi.k, javier.gonz, Nitesh Shetty,
Hannes Reinecke
On Tue, Oct 22, 2024 at 08:37:56AM -0600, Keith Busch wrote:
> No doubt it's almost certainly not a good idea to mix different stream
> usages, but that's not the kernels problem. It's user space policy. I
> don't think the kernel needs to perform any heroic efforts to split
> anything here. Just keep it simple.
Unfortunately it's not. This complicated breaks any intelligent
scheme to manage them. You can't write portable code assuming you
can know write streams when you need to cope with someone else using
some or all of the same resources.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-18 5:46 ` Christoph Hellwig
@ 2024-10-24 19:45 ` Keith Busch
2024-10-25 12:20 ` Christoph Hellwig
0 siblings, 1 reply; 31+ messages in thread
From: Keith Busch @ 2024-10-24 19:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-block, linux-nvme, axboe, io-uring,
linux-fsdevel, joshi.k, javier.gonz
On Fri, Oct 18, 2024 at 07:46:44AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:09:33AM -0700, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> >
> > This is still backwards compatible with lifetime hints. It just doesn't
> > constrain the hints to that definition.
>
> So in the end we'll end up with two uses of it - the existing 5
> temperature hints and the new stream separation. I think it
> would be cleaner to make it a union, but I don't care that
> strongly.
>
> But we probably want a way to distinguish which one is supported.
>
> E.g. for SCSI we set a net BLK_FEAT_WRITE_HINTS, for NVMe we'll set
> BLK_FEAT_STREAM_SEPARATION.
>
> Either way this should probably be the first patch in the series.
I'm not sure I follow this feedback. The SCSI feature is defined as a
lifetime stream association in SBC-5. So it's still a stream for SCSI,
but you want to call it "WRITE_HINT", which is not a term used in the
SCSI spec for this feature. But, you want to call it STREAM_SEPARATION
for NVMe only, even though the FDP spec doesn't use that term? What's
wrong with just calling it a generic hint support feature?
I also don't see why SCSI couldn't use per-io hints just like this
enables for NVMe. The spec doesn't limit SCSI to just 5 streams, so this
provides a way to access them all through the raw block device.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCHv8 2/6] block: use generic u16 for write hints
2024-10-24 19:45 ` Keith Busch
@ 2024-10-25 12:20 ` Christoph Hellwig
0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2024-10-25 12:20 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
io-uring, linux-fsdevel, joshi.k, javier.gonz
On Thu, Oct 24, 2024 at 01:45:02PM -0600, Keith Busch wrote:
> I'm not sure I follow this feedback. The SCSI feature is defined as a
> lifetime stream association in SBC-5. So it's still a stream for SCSI,
> but you want to call it "WRITE_HINT", which is not a term used in the
> SCSI spec for this feature. But, you want to call it STREAM_SEPARATION
> for NVMe only, even though the FDP spec doesn't use that term? What's
> wrong with just calling it a generic hint support feature?
The "Constrained Streams with Data Lifetimes" are called streams for
political reasons but they are not. They are buckets of different data
lifetimes.
> I also don't see why SCSI couldn't use per-io hints just like this
> enables for NVMe. The spec doesn't limit SCSI to just 5 streams, so this
> provides a way to access them all through the raw block device.
I don't mind passing per-I/O temperature hints to SCSI block devices.
But we should not confuse a streams/FDP like streams that are different
context which are assumed to be discarded together and have a concept of
Stream Granularity Size or Reclaim Unit size with the data temperature
hints that are at the storage level fundamentally per-I/O and just bucket
into temperature group without any indication of data locality.
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2024-10-25 12:20 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-17 16:09 [PATCHv8 0/6] write hints for nvme fdp Keith Busch
2024-10-17 16:09 ` [PATCHv8 1/6] block, fs: restore kiocb based write hint processing Keith Busch
2024-10-18 5:50 ` Christoph Hellwig
2024-10-21 15:47 ` Keith Busch
2024-10-21 17:09 ` Bart Van Assche
2024-10-21 19:35 ` Keith Busch
2024-10-22 6:43 ` Christoph Hellwig
2024-10-22 14:37 ` Keith Busch
2024-10-22 14:40 ` Christoph Hellwig
2024-10-18 16:11 ` Bart Van Assche
2024-10-17 16:09 ` [PATCHv8 2/6] block: use generic u16 for write hints Keith Busch
2024-10-18 5:46 ` Christoph Hellwig
2024-10-24 19:45 ` Keith Busch
2024-10-25 12:20 ` Christoph Hellwig
2024-10-18 6:00 ` Hannes Reinecke
2024-10-18 16:12 ` Bart Van Assche
2024-10-21 15:03 ` Keith Busch
2024-10-17 16:09 ` [PATCHv8 3/6] block: introduce max_write_hints queue limit Keith Busch
2024-10-18 5:51 ` Christoph Hellwig
2024-10-18 6:01 ` Hannes Reinecke
2024-10-18 16:18 ` Bart Van Assche
2024-10-21 15:02 ` Keith Busch
2024-10-17 16:09 ` [PATCHv8 4/6] fs: introduce per-io hint support flag Keith Busch
2024-10-18 5:52 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
2024-10-17 16:09 ` [PATCHv8 5/6] io_uring: enable per-io hinting capability Keith Busch
2024-10-18 5:53 ` Christoph Hellwig
2024-10-18 6:03 ` Hannes Reinecke
2024-10-17 16:09 ` [PATCHv8 6/6] nvme: enable FDP support Keith Busch
2024-10-18 10:48 ` Kanchan Joshi
2024-10-21 15:08 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox