* [PATCHv11 00/10] block write streams with nvme fdp
@ 2024-12-06 1:52 Keith Busch
2024-12-06 1:52 ` [PATCHv11 01/10] fs: add a write stream field to the kiocb Keith Busch
` (11 more replies)
0 siblings, 12 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:52 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
Changes from v10:
Merged up to block for-6.14/io_uring, which required some
new attribute handling.
Not mixing write hints usage with write streams. This effectively
abandons any attempts to use the existing fcntl API for use with
filesystems in this series.
Exporting the stream's reclaim unit nominal size.
Christoph Hellwig (5):
fs: add a write stream field to the kiocb
block: add a bi_write_stream field
block: introduce a write_stream_granularity queue limit
block: expose write streams for block device nodes
nvme: add a nvme_get_log_lsi helper
Keith Busch (5):
io_uring: protection information enhancements
io_uring: add write stream attribute
block: introduce max_write_streams queue limit
nvme: register fdp queue limits
nvme: use fdp streams if write stream is provided
Documentation/ABI/stable/sysfs-block | 15 +++
block/bdev.c | 6 +
block/bio.c | 2 +
block/blk-crypto-fallback.c | 1 +
block/blk-merge.c | 4 +
block/blk-sysfs.c | 6 +
block/bounce.c | 1 +
block/fops.c | 23 ++++
drivers/nvme/host/core.c | 160 ++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 5 +
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 16 +++
include/linux/fs.h | 1 +
include/linux/nvme.h | 73 ++++++++++++
include/uapi/linux/io_uring.h | 21 +++-
io_uring/rw.c | 38 ++++++-
16 files changed, 359 insertions(+), 14 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCHv11 01/10] fs: add a write stream field to the kiocb
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
@ 2024-12-06 1:52 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 02/10] io_uring: protection information enhancements Keith Busch
` (10 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:52 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Christoph Hellwig <[email protected]>
Prepare for io_uring passthrough of write streams. The write stream
field in the kiocb structure fits into an existing 2-byte hole, so its
size is not changed.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/fs.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2cc3d45da7b01..26940c451f319 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -373,6 +373,7 @@ struct kiocb {
void *private;
int ki_flags;
u16 ki_ioprio; /* See linux/ioprio.h */
+ u8 ki_write_stream;
union {
/*
* Only used for async buffered reads, where it denotes the
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 02/10] io_uring: protection information enhancements
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
2024-12-06 1:52 ` [PATCHv11 01/10] fs: add a write stream field to the kiocb Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
[not found] ` <CGME20241206095739epcas5p1ee968cb92c9d4ceb25a79ad80521601f@epcas5p1.samsung.com>
2024-12-06 1:53 ` [PATCHv11 03/10] io_uring: add write stream attribute Keith Busch
` (9 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
Just fixing up some formatting, removing unused parameters, and paving
the way to allow chaining additional arbitrary attributes.
Signed-off-by: Keith Busch <[email protected]>
---
include/uapi/linux/io_uring.h | 14 ++++++++------
io_uring/rw.c | 10 +++++-----
2 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 38f0d6b10eaf7..5fa38467d6070 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -115,14 +115,16 @@ struct io_uring_sqe {
#define IORING_RW_ATTR_FLAG_PI (1U << 0)
/* PI attribute information */
struct io_uring_attr_pi {
- __u16 flags;
- __u16 app_tag;
- __u32 len;
- __u64 addr;
- __u64 seed;
- __u64 rsvd;
+ __u16 flags;
+ __u16 app_tag;
+ __u32 len;
+ __u64 addr;
+ __u64 seed;
+ __u64 rsvd;
};
+#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI)
+
/*
* If sqe->file_index is set to this for opcodes that instantiate a new
* direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 04e4467ab0ee8..a2987aefb2cec 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -272,14 +272,14 @@ static inline void io_meta_restore(struct io_async_rw *io, struct kiocb *kiocb)
}
static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
- u64 attr_ptr, u64 attr_type_mask)
+ u64 *attr_ptr)
{
struct io_uring_attr_pi pi_attr;
struct io_async_rw *io;
int ret;
- if (copy_from_user(&pi_attr, u64_to_user_ptr(attr_ptr),
- sizeof(pi_attr)))
+ if (copy_from_user(&pi_attr, u64_to_user_ptr(*attr_ptr),
+ sizeof(pi_attr)))
return -EFAULT;
if (pi_attr.rsvd)
@@ -295,6 +295,7 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
return ret;
rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
io_meta_save_state(io);
+ *attr_ptr += sizeof(pi_attr);
return ret;
}
@@ -335,8 +336,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
if (attr_type_mask) {
u64 attr_ptr;
- /* only PI attribute is supported currently */
- if (attr_type_mask != IORING_RW_ATTR_FLAG_PI)
+ if (attr_type_mask & ~IORING_RW_ATTR_FLAGS_SUPPORTED)
return -EINVAL;
attr_ptr = READ_ONCE(sqe->attr_ptr);
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 03/10] io_uring: add write stream attribute
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
2024-12-06 1:52 ` [PATCHv11 01/10] fs: add a write stream field to the kiocb Keith Busch
2024-12-06 1:53 ` [PATCHv11 02/10] io_uring: protection information enhancements Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
[not found] ` <CGME20241206100326epcas5p17d4dad663ccc6c6f40cfab98437e63f3@epcas5p1.samsung.com>
2024-12-06 12:44 ` Kanchan Joshi
2024-12-06 1:53 ` [PATCHv11 04/10] block: add a bi_write_stream field Keith Busch
` (8 subsequent siblings)
11 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
Adds a new attribute type to specify a write stream per-IO.
Signed-off-by: Keith Busch <[email protected]>
---
include/uapi/linux/io_uring.h | 9 ++++++++-
io_uring/rw.c | 28 +++++++++++++++++++++++++++-
2 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5fa38467d6070..263cd57aae72d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -123,7 +123,14 @@ struct io_uring_attr_pi {
__u64 rsvd;
};
-#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI)
+#define IORING_RW_ATTR_FLAG_WRITE_STREAM (1U << 1)
+struct io_uring_write_stream {
+ __u16 write_stream;
+ __u8 rsvd[6];
+};
+
+#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI | \
+ IORING_RW_ATTR_FLAG_WRITE_STREAM)
/*
* If sqe->file_index is set to this for opcodes that instantiate a new
diff --git a/io_uring/rw.c b/io_uring/rw.c
index a2987aefb2cec..69b566e296f6d 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -299,6 +299,22 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
return ret;
}
+static int io_prep_rw_write_stream(struct io_rw *rw, u64 *attr_ptr)
+{
+ struct io_uring_write_stream write_stream;
+
+ if (copy_from_user(&write_stream, u64_to_user_ptr(*attr_ptr),
+ sizeof(write_stream)))
+ return -EFAULT;
+
+ if (!memchr_inv(write_stream.rsvd, 0, sizeof(write_stream.rsvd)))
+ return -EINVAL;
+
+ rw->kiocb.ki_write_stream = write_stream.write_stream;
+ *attr_ptr += sizeof(write_stream);
+ return 0;
+}
+
static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir, bool do_import)
{
@@ -340,7 +356,17 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
return -EINVAL;
attr_ptr = READ_ONCE(sqe->attr_ptr);
- ret = io_prep_rw_pi(req, rw, ddir, attr_ptr, attr_type_mask);
+ if (attr_type_mask & IORING_RW_ATTR_FLAG_PI) {
+ ret = io_prep_rw_pi(req, rw, ddir, &attr_ptr);
+ if (ret)
+ return ret;
+ }
+
+ if (attr_type_mask & IORING_RW_ATTR_FLAG_WRITE_STREAM) {
+ ret = io_prep_rw_write_stream(rw, &attr_ptr);
+ if (ret)
+ return ret;
+ }
}
return ret;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 04/10] block: add a bi_write_stream field
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (2 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 03/10] io_uring: add write stream attribute Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 05/10] block: introduce max_write_streams queue limit Keith Busch
` (7 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Christoph Hellwig <[email protected]>
Add the ability to pass a write stream for placement control in the bio.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
block/bio.c | 2 ++
block/blk-crypto-fallback.c | 1 +
block/blk-merge.c | 4 ++++
block/bounce.c | 1 +
include/linux/blk_types.h | 1 +
5 files changed, 9 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index 699a78c85c756..2aa86edc7cd6f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -251,6 +251,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
bio->bi_flags = 0;
bio->bi_ioprio = 0;
bio->bi_write_hint = 0;
+ bio->bi_write_stream = 0;
bio->bi_status = 0;
bio->bi_iter.bi_sector = 0;
bio->bi_iter.bi_size = 0;
@@ -827,6 +828,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
bio_set_flag(bio, BIO_CLONED);
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_write_stream = bio_src->bi_write_stream;
bio->bi_iter = bio_src->bi_iter;
if (bio->bi_bdev) {
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 29a205482617c..66762243a886b 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -173,6 +173,7 @@ static struct bio *blk_crypto_fallback_clone_bio(struct bio *bio_src)
bio_set_flag(bio, BIO_REMAPPED);
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_write_stream = bio_src->bi_write_stream;
bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index e01383c6e534b..1e5327fb6c45b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -866,6 +866,8 @@ static struct request *attempt_merge(struct request_queue *q,
if (req->bio->bi_write_hint != next->bio->bi_write_hint)
return NULL;
+ if (req->bio->bi_write_stream != next->bio->bi_write_stream)
+ return NULL;
if (req->bio->bi_ioprio != next->bio->bi_ioprio)
return NULL;
if (!blk_atomic_write_mergeable_rqs(req, next))
@@ -987,6 +989,8 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
return false;
if (rq->bio->bi_write_hint != bio->bi_write_hint)
return false;
+ if (rq->bio->bi_write_stream != bio->bi_write_stream)
+ return false;
if (rq->bio->bi_ioprio != bio->bi_ioprio)
return false;
if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
diff --git a/block/bounce.c b/block/bounce.c
index 0d898cd5ec497..fb8f60f114d7d 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -170,6 +170,7 @@ static struct bio *bounce_clone_bio(struct bio *bio_src)
bio_set_flag(bio, BIO_REMAPPED);
bio->bi_ioprio = bio_src->bi_ioprio;
bio->bi_write_hint = bio_src->bi_write_hint;
+ bio->bi_write_stream = bio_src->bi_write_stream;
bio->bi_iter.bi_sector = bio_src->bi_iter.bi_sector;
bio->bi_iter.bi_size = bio_src->bi_iter.bi_size;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7e..4ca3449ce9c95 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -220,6 +220,7 @@ struct bio {
unsigned short bi_flags; /* BIO_* below */
unsigned short bi_ioprio;
enum rw_hint bi_write_hint;
+ u8 bi_write_stream;
blk_status_t bi_status;
atomic_t __bi_remaining;
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 05/10] block: introduce max_write_streams queue limit
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (3 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 04/10] block: add a bi_write_stream field Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 06/10] block: introduce a write_stream_granularity " Keith Busch
` (6 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.
Signed-off-by: Keith Busch <[email protected]>
[hch: renamed hints to streams, removed stacking]
Signed-off-by: Christoph Hellwig <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 7 +++++++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 9 +++++++++
3 files changed, 19 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0cceb2badc836..f67139b8b8eff 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -506,6 +506,13 @@ Description:
[RO] Maximum size in bytes of a single element in a DMA
scatter/gather list.
+What: /sys/block/<disk>/queue/max_write_streams
+Date: November 2024
+Contact: [email protected]
+Description:
+ [RO] Maximum number of write streams supported, 0 if not
+ supported. If supported, valid values are 1 through
+ max_write_streams, inclusive.
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4241aea84161c..c514c0cb5e93c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -104,6 +104,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
+QUEUE_SYSFS_LIMIT_SHOW(max_write_streams)
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -446,6 +447,7 @@ QUEUE_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
+QUEUE_RO_ENTRY(queue_max_write_streams, "max_write_streams");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -580,6 +582,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_discard_segments_entry.attr,
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
+ &queue_max_write_streams_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08a727b408164..ce2c3ddda2411 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -399,6 +399,8 @@ struct queue_limits {
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned short max_write_streams;
+
unsigned int max_open_zones;
unsigned int max_active_zones;
@@ -1240,6 +1242,13 @@ static inline unsigned int bdev_max_segments(struct block_device *bdev)
return queue_max_segments(bdev_get_queue(bdev));
}
+static inline unsigned short bdev_max_write_streams(struct block_device *bdev)
+{
+ if (bdev_is_partition(bdev))
+ return 0;
+ return bdev_limits(bdev)->max_write_streams;
+}
+
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
return q->limits.logical_block_size;
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 06/10] block: introduce a write_stream_granularity queue limit
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (4 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 05/10] block: introduce max_write_streams queue limit Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 07/10] block: expose write streams for block device nodes Keith Busch
` (5 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Christoph Hellwig <[email protected]>
Export the granularity that write streams should be discarded with,
as it is essential for making good use of them.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-block | 8 ++++++++
block/blk-sysfs.c | 3 +++
include/linux/blkdev.h | 7 +++++++
3 files changed, 18 insertions(+)
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index f67139b8b8eff..c454c68b68fe6 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -514,6 +514,14 @@ Description:
supported. If supported, valid values are 1 through
max_write_streams, inclusive.
+What: /sys/block/<disk>/queue/write_stream_granularity
+Date: November 2024
+Contact: [email protected]
+Description:
+ [RO] Granularity of a write stream in bytes. The granularity
+ of a write stream is the size that should be discarded or
+ overwritten together to avoid write amplification in the device.
+
What: /sys/block/<disk>/queue/max_segments
Date: March 2010
Contact: [email protected]
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c514c0cb5e93c..525f4fa132cd3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -105,6 +105,7 @@ QUEUE_SYSFS_LIMIT_SHOW(max_discard_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_integrity_segments)
QUEUE_SYSFS_LIMIT_SHOW(max_segment_size)
QUEUE_SYSFS_LIMIT_SHOW(max_write_streams)
+QUEUE_SYSFS_LIMIT_SHOW(write_stream_granularity)
QUEUE_SYSFS_LIMIT_SHOW(logical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(physical_block_size)
QUEUE_SYSFS_LIMIT_SHOW(chunk_sectors)
@@ -448,6 +449,7 @@ QUEUE_RO_ENTRY(queue_max_segments, "max_segments");
QUEUE_RO_ENTRY(queue_max_integrity_segments, "max_integrity_segments");
QUEUE_RO_ENTRY(queue_max_segment_size, "max_segment_size");
QUEUE_RO_ENTRY(queue_max_write_streams, "max_write_streams");
+QUEUE_RO_ENTRY(queue_write_stream_granularity, "write_stream_granularity");
QUEUE_RW_LOAD_MODULE_ENTRY(elv_iosched, "scheduler");
QUEUE_RO_ENTRY(queue_logical_block_size, "logical_block_size");
@@ -583,6 +585,7 @@ static struct attribute *queue_attrs[] = {
&queue_max_integrity_segments_entry.attr,
&queue_max_segment_size_entry.attr,
&queue_max_write_streams_entry.attr,
+ &queue_write_stream_granularity_entry.attr,
&queue_hw_sector_size_entry.attr,
&queue_logical_block_size_entry.attr,
&queue_physical_block_size_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ce2c3ddda2411..7be8cc57561a1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -400,6 +400,7 @@ struct queue_limits {
unsigned short max_discard_segments;
unsigned short max_write_streams;
+ unsigned int write_stream_granularity;
unsigned int max_open_zones;
unsigned int max_active_zones;
@@ -1249,6 +1250,12 @@ static inline unsigned short bdev_max_write_streams(struct block_device *bdev)
return bdev_limits(bdev)->max_write_streams;
}
+static inline unsigned int
+bdev_write_stream_granularity(struct block_device *bdev)
+{
+ return bdev_limits(bdev)->write_stream_granularity;
+}
+
static inline unsigned queue_logical_block_size(const struct request_queue *q)
{
return q->limits.logical_block_size;
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 07/10] block: expose write streams for block device nodes
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (5 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 06/10] block: introduce a write_stream_granularity " Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
[not found] ` <CGME20241206091949epcas5p14a01e4cfe614ddd04e23b84f8f1036d5@epcas5p1.samsung.com>
2024-12-06 1:53 ` [PATCHv11 08/10] nvme: add a nvme_get_log_lsi helper Keith Busch
` (4 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Christoph Hellwig <[email protected]>
Export statx information about the number and granularity of write
streams, use the per-kiocb write hint and map temperature hints to write
streams (which is a bit questionable, but this shows how it is done).
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
block/bdev.c | 6 ++++++
block/fops.c | 23 +++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index 738e3c8457e7f..c23245f1fdfe3 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1296,6 +1296,12 @@ void bdev_statx(struct path *path, struct kstat *stat,
stat->result_mask |= STATX_DIOALIGN;
}
+ if ((request_mask & STATX_WRITE_STREAM) &&
+ bdev_max_write_streams(bdev)) {
+ stat->write_stream_max = bdev_max_write_streams(bdev);
+ stat->result_mask |= STATX_WRITE_STREAM;
+ }
+
if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) {
struct request_queue *bd_queue = bdev->bd_queue;
diff --git a/block/fops.c b/block/fops.c
index 6d5c4fc5a2168..f16aa39bf5bad 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -73,6 +73,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
}
bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio.bi_write_stream = iocb->ki_write_stream;
bio.bi_ioprio = iocb->ki_ioprio;
if (iocb->ki_flags & IOCB_ATOMIC)
bio.bi_opf |= REQ_ATOMIC;
@@ -206,6 +207,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
for (;;) {
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_stream = iocb->ki_write_stream;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -333,6 +335,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
dio->iocb = iocb;
bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+ bio->bi_write_stream = iocb->ki_write_stream;
bio->bi_end_io = blkdev_bio_end_io_async;
bio->bi_ioprio = iocb->ki_ioprio;
@@ -398,6 +401,26 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
if (blkdev_dio_invalid(bdev, iocb, iter))
return -EINVAL;
+ if (iov_iter_rw(iter) == WRITE) {
+ u16 max_write_streams = bdev_max_write_streams(bdev);
+
+ if (iocb->ki_write_stream) {
+ if (iocb->ki_write_stream > max_write_streams)
+ return -EINVAL;
+ } else if (max_write_streams) {
+ enum rw_hint write_hint =
+ file_inode(iocb->ki_filp)->i_write_hint;
+
+ /*
+ * Just use the write hint as write stream for block
+ * device writes. This assumes no file system is
+ * mounted that would use the streams differently.
+ */
+ if (write_hint <= max_write_streams)
+ iocb->ki_write_stream = write_hint;
+ }
+ }
+
nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1);
if (likely(nr_pages <= BIO_MAX_VECS)) {
if (is_sync_kiocb(iocb))
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 08/10] nvme: add a nvme_get_log_lsi helper
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (6 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 07/10] block: expose write streams for block device nodes Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 09/10] nvme: register fdp queue limits Keith Busch
` (3 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Christoph Hellwig <[email protected]>
For log pages that need to pass in a LSI value, while at the same time
not touching all the existing nvme_get_log callers.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 571d4106d256d..36c44be98e38c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -150,6 +150,8 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
unsigned nsid);
static void nvme_update_keep_alive(struct nvme_ctrl *ctrl,
struct nvme_command *cmd);
+static int nvme_get_log_lsi(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page,
+ u8 lsp, u8 csi, void *log, size_t size, u64 offset, u16 lsi);
void nvme_queue_scan(struct nvme_ctrl *ctrl)
{
@@ -3074,8 +3076,8 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
return ret;
}
-int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
- void *log, size_t size, u64 offset)
+static int nvme_get_log_lsi(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page,
+ u8 lsp, u8 csi, void *log, size_t size, u64 offset, u16 lsi)
{
struct nvme_command c = { };
u32 dwlen = nvme_bytes_to_numd(size);
@@ -3089,10 +3091,18 @@ int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
c.get_log_page.lpol = cpu_to_le32(lower_32_bits(offset));
c.get_log_page.lpou = cpu_to_le32(upper_32_bits(offset));
c.get_log_page.csi = csi;
+ c.get_log_page.lsi = cpu_to_le16(lsi);
return nvme_submit_sync_cmd(ctrl->admin_q, &c, log, size);
}
+int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi,
+ void *log, size_t size, u64 offset)
+{
+ return nvme_get_log_lsi(ctrl, nsid, log_page, lsp, csi, log, size,
+ offset, 0);
+}
+
static int nvme_get_effects_log(struct nvme_ctrl *ctrl, u8 csi,
struct nvme_effects_log **log)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 09/10] nvme: register fdp queue limits
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (7 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 08/10] nvme: add a nvme_get_log_lsi helper Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 5:26 ` kernel test robot
2024-12-06 1:53 ` [PATCHv11 10/10] nvme: use fdp streams if write stream is provided Keith Busch
` (2 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch, Keith Busch
From: Keith Busch <[email protected]>
Register the device data placement limits if supported. This is just
registering the limits with the block layer. Nothing beyond reporting
these attributes is happening in this patch.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 116 +++++++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 4 ++
include/linux/nvme.h | 73 ++++++++++++++++++++++++
3 files changed, 193 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 36c44be98e38c..410a77de92f88 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -38,6 +38,8 @@ struct nvme_ns_info {
u32 nsid;
__le32 anagrpid;
u8 pi_offset;
+ u16 endgid;
+ u64 runs;
bool is_shared;
bool is_readonly;
bool is_ready;
@@ -1613,6 +1615,7 @@ static int nvme_ns_info_from_identify(struct nvme_ctrl *ctrl,
info->is_shared = id->nmic & NVME_NS_NMIC_SHARED;
info->is_readonly = id->nsattr & NVME_NS_ATTR_RO;
info->is_ready = true;
+ info->endgid = le16_to_cpu(id->endgid);
if (ctrl->quirks & NVME_QUIRK_BOGUS_NID) {
dev_info(ctrl->device,
"Ignoring bogus Namespace Identifiers\n");
@@ -1653,6 +1656,7 @@ static int nvme_ns_info_from_id_cs_indep(struct nvme_ctrl *ctrl,
info->is_ready = id->nstat & NVME_NSTAT_NRDY;
info->is_rotational = id->nsfeat & NVME_NS_ROTATIONAL;
info->no_vwc = id->nsfeat & NVME_NS_VWC_NOT_PRESENT;
+ info->endgid = le16_to_cpu(id->endgid);
}
kfree(id);
return ret;
@@ -2147,6 +2151,101 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
return ret;
}
+static int nvme_check_fdp(struct nvme_ns *ns, struct nvme_ns_info *info,
+ u8 fdp_idx)
+{
+ struct nvme_fdp_config_log hdr, *h;
+ size_t size = sizeof(hdr);
+ int i, n, ret;
+ void *log;
+
+ info->runs = 0;
+ ret = nvme_get_log_lsi(ns->ctrl, 0, NVME_LOG_FDP_CONFIG, 0, NVME_CSI_NVM,
+ (void *)&hdr, size, 0, info->endgid);
+ if (ret)
+ return ret;
+
+ size = le32_to_cpu(hdr.sze);
+ log = kzalloc(size, GFP_KERNEL);
+ if (!log)
+ return 0;
+
+ ret = nvme_get_log_lsi(ns->ctrl, 0, NVME_LOG_FDP_CONFIG, 0, NVME_CSI_NVM,
+ log, size, 0, info->endgid);
+ if (ret)
+ goto out;
+
+ n = le16_to_cpu(h->numfdpc) + 1;
+ if (fdp_idx > n)
+ goto out;
+
+ h = log;
+ log = h->configs;
+ for (i = 0; i < n; i++) {
+ struct nvme_fdp_config_desc *config = log;
+
+ if (i == fdp_idx) {
+ info->runs = le64_to_cpu(config->runs);
+ break;
+ }
+ log += le16_to_cpu(config->size);
+ }
+out:
+ kfree(h);
+ return ret;
+}
+
+static int nvme_query_fdp_info(struct nvme_ns *ns, struct nvme_ns_info *info)
+{
+ struct nvme_ns_head *head = ns->head;
+ struct nvme_fdp_ruh_status *ruhs;
+ struct nvme_command c = {};
+ u32 fdp, fdp_idx;
+ int size, ret;
+
+ ret = nvme_get_features(ns->ctrl, NVME_FEAT_FDP, info->endgid, NULL, 0,
+ &fdp);
+ if (ret)
+ goto err;
+
+ if (!(fdp & NVME_FDP_FDPE))
+ goto err;
+
+ fdp_idx = (fdp >> NVME_FDP_FDPCIDX_SHIFT) & NVME_FDP_FDPCIDX_MASK;
+ ret = nvme_check_fdp(ns, info, fdp_idx);
+ if (ret || !info->runs)
+ goto err;
+
+ size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+ ruhs = kzalloc(size, GFP_KERNEL);
+ if (!ruhs) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ c.imr.opcode = nvme_cmd_io_mgmt_recv;
+ c.imr.nsid = cpu_to_le32(head->ns_id);
+ c.imr.mo = NVME_IO_MGMT_RECV_MO_RUHS;
+ c.imr.numd = cpu_to_le32(nvme_bytes_to_numd(size));
+ ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+ if (ret)
+ goto free;
+
+ head->nr_plids = le16_to_cpu(ruhs->nruhsd);
+ if (!head->nr_plids)
+ goto free;
+
+ kfree(ruhs);
+ return 0;
+
+free:
+ kfree(ruhs);
+err:
+ head->nr_plids = 0;
+ info->runs = 0;
+ return ret;
+}
+
static int nvme_update_ns_info_block(struct nvme_ns *ns,
struct nvme_ns_info *info)
{
@@ -2183,6 +2282,15 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
goto out;
}
+ if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+ ret = nvme_query_fdp_info(ns, info);
+ if (ret)
+ dev_warn(ns->ctrl->device,
+ "FDP failure status:0x%x\n", ret);
+ if (ret < 0)
+ goto out;
+ }
+
blk_mq_freeze_queue(ns->disk->queue);
ns->head->lba_shift = id->lbaf[lbaf].ds;
ns->head->nuse = le64_to_cpu(id->nuse);
@@ -2216,6 +2324,12 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
if (!nvme_init_integrity(ns->head, &lim, info))
capacity = 0;
+ lim.max_write_streams = ns->head->nr_plids;
+ if (lim.max_write_streams)
+ lim.write_stream_granularity = info->runs;
+ else
+ lim.write_stream_granularity = 0;
+
ret = queue_limits_commit_update(ns->disk->queue, &lim);
if (ret) {
blk_mq_unfreeze_queue(ns->disk->queue);
@@ -2318,6 +2432,8 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_ns_info *info)
ns->head->disk->flags |= GENHD_FL_HIDDEN;
else
nvme_init_integrity(ns->head, &lim, info);
+ lim.max_write_streams = ns_lim->max_write_streams;
+ lim.write_stream_granularity = ns_lim->write_stream_granularity;
ret = queue_limits_commit_update(ns->head->disk->queue, &lim);
set_capacity_and_notify(ns->head->disk, get_capacity(ns->disk));
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 611b02c8a8b37..5c8bdaa2c8824 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
u8 csi;
};
+#define NVME_MAX_PLIDS (S8_MAX - 1)
+
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
@@ -491,6 +493,8 @@ struct nvme_ns_head {
struct device cdev_device;
struct gendisk *disk;
+
+ u16 nr_plids;
#ifdef CONFIG_NVME_MULTIPATH
struct bio_list requeue_list;
spinlock_t requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 13377dde4527b..78657a8e39561 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
NVME_CTRL_ATTR_HID_128_BIT = (1 << 0),
NVME_CTRL_ATTR_TBKAS = (1 << 6),
NVME_CTRL_ATTR_ELBAS = (1 << 15),
+ NVME_CTRL_ATTR_FDPS = (1 << 19),
};
struct nvme_id_ctrl {
@@ -761,6 +762,34 @@ struct nvme_zone_report {
struct nvme_zone_descriptor entries[];
};
+struct nvme_fdp_ruh_desc {
+ __u8 ruht;
+ __u8 rsvd1[3];
+};
+
+struct nvme_fdp_config_desc {
+ __le16 size;
+ __u8 fdpa;
+ __u8 vss;
+ __le32 nrg;
+ __le16 nruh;
+ __le16 maxpids;
+ __le32 nnss;
+ __le64 runs;
+ __le32 erutl;
+ __u8 rsvd28[36];
+ struct nvme_fdp_ruh_desc ruhs[];
+};
+
+struct nvme_fdp_config_log {
+ __le16 numfdpc;
+ __u8 ver;
+ __u8 rsvd3;
+ __le32 sze;
+ __u8 rsvd8[8];
+ struct nvme_fdp_config_desc configs[];
+};
+
enum {
NVME_SMART_CRIT_SPARE = 1 << 0,
NVME_SMART_CRIT_TEMPERATURE = 1 << 1,
@@ -887,6 +916,7 @@ enum nvme_opcode {
nvme_cmd_resv_register = 0x0d,
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
+ nvme_cmd_io_mgmt_recv = 0x12,
nvme_cmd_resv_release = 0x15,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
@@ -908,6 +938,7 @@ enum nvme_opcode {
nvme_opcode_name(nvme_cmd_resv_register), \
nvme_opcode_name(nvme_cmd_resv_report), \
nvme_opcode_name(nvme_cmd_resv_acquire), \
+ nvme_opcode_name(nvme_cmd_io_mgmt_recv), \
nvme_opcode_name(nvme_cmd_resv_release), \
nvme_opcode_name(nvme_cmd_zone_mgmt_send), \
nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \
@@ -1059,6 +1090,7 @@ enum {
NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12,
NVME_RW_PRINFO_PRACT = 1 << 13,
NVME_RW_DTYPE_STREAMS = 1 << 4,
+ NVME_RW_DTYPE_DPLCMT = 2 << 4,
NVME_WZ_DEAC = 1 << 9,
};
@@ -1146,6 +1178,38 @@ struct nvme_zone_mgmt_recv_cmd {
__le32 cdw14[2];
};
+struct nvme_io_mgmt_recv_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __u8 mo;
+ __u8 rsvd11;
+ __u16 mos;
+ __le32 numd;
+ __le32 cdw12[4];
+};
+
+enum {
+ NVME_IO_MGMT_RECV_MO_RUHS = 1,
+};
+
+struct nvme_fdp_ruh_status_desc {
+ u16 pid;
+ u16 ruhid;
+ u32 earutr;
+ u64 ruamw;
+ u8 rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+ u8 rsvd0[14];
+ __le16 nruhsd;
+ struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
enum {
NVME_ZRA_ZONE_REPORT = 0,
NVME_ZRASF_ZONE_REPORT_ALL = 0,
@@ -1281,6 +1345,7 @@ enum {
NVME_FEAT_PLM_WINDOW = 0x14,
NVME_FEAT_HOST_BEHAVIOR = 0x16,
NVME_FEAT_SANITIZE = 0x17,
+ NVME_FEAT_FDP = 0x1d,
NVME_FEAT_SW_PROGRESS = 0x80,
NVME_FEAT_HOST_ID = 0x81,
NVME_FEAT_RESV_MASK = 0x82,
@@ -1301,6 +1366,7 @@ enum {
NVME_LOG_ANA = 0x0c,
NVME_LOG_FEATURES = 0x12,
NVME_LOG_RMI = 0x16,
+ NVME_LOG_FDP_CONFIG = 0x20,
NVME_LOG_DISC = 0x70,
NVME_LOG_RESERVATION = 0x80,
NVME_FWACT_REPL = (0 << 3),
@@ -1326,6 +1392,12 @@ enum {
NVME_FIS_CSCPE = 1 << 21,
};
+enum {
+ NVME_FDP_FDPE = 1 << 0,
+ NVME_FDP_FDPCIDX_SHIFT = 8,
+ NVME_FDP_FDPCIDX_MASK = 0xff,
+};
+
/* NVMe Namespace Write Protect State */
enum {
NVME_NS_NO_WRITE_PROTECT = 0,
@@ -1888,6 +1960,7 @@ struct nvme_command {
struct nvmf_auth_receive_command auth_receive;
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
+ struct nvme_io_mgmt_recv_cmd imr;
};
};
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCHv11 10/10] nvme: use fdp streams if write stream is provided
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (8 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 09/10] nvme: register fdp queue limits Keith Busch
@ 2024-12-06 1:53 ` Keith Busch
2024-12-06 13:18 ` kernel test robot
2024-12-06 2:18 ` [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
2024-12-09 12:51 ` Christoph Hellwig
11 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-06 1:53 UTC (permalink / raw)
To: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring
Cc: sagi, asml.silence, Keith Busch
From: Keith Busch <[email protected]>
Maps a user requested write stream to an FDP placement ID if possible.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/nvme/host/core.c | 32 +++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 32 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 410a77de92f88..c6f48403fc51c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -997,6 +997,18 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
if (req->cmd_flags & REQ_RAHEAD)
dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+ if (op == nvme_cmd_write && ns->head->nr_plids) {
+ u16 write_stream = req->bio->bi_write_stream;
+
+ if (WARN_ON_ONCE(write_stream > ns->head->nr_plids))
+ return BLK_STS_INVAL;
+
+ if (write_stream) {
+ dsmgmt |= ns->head->plids[write_stream - 1] << 16;
+ control |= NVME_RW_DTYPE_DPLCMT;
+ }
+ }
+
if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
return BLK_STS_INVAL;
@@ -2197,11 +2209,12 @@ static int nvme_check_fdp(struct nvme_ns *ns, struct nvme_ns_info *info,
static int nvme_query_fdp_info(struct nvme_ns *ns, struct nvme_ns_info *info)
{
+ struct nvme_fdp_ruh_status_desc *ruhsd;
struct nvme_ns_head *head = ns->head;
struct nvme_fdp_ruh_status *ruhs;
struct nvme_command c = {};
u32 fdp, fdp_idx;
- int size, ret;
+ int size, ret, i;
ret = nvme_get_features(ns->ctrl, NVME_FEAT_FDP, info->endgid, NULL, 0,
&fdp);
@@ -2235,6 +2248,19 @@ static int nvme_query_fdp_info(struct nvme_ns *ns, struct nvme_ns_info *info)
if (!head->nr_plids)
goto free;
+ head->nr_plids = min(head->nr_plids, NVME_MAX_PLIDS);
+ head->plids = kcalloc(head->nr_plids, sizeof(head->plids),
+ GFP_KERNEL);
+ if (!head->plids) {
+ ret = -ENOMEM;
+ goto free;
+ }
+
+ for (i = 0; i < head->nr_plids; i++) {
+ ruhsd = &ruhs->ruhsd[i];
+ head->plids[i] = le16_to_cpu(ruhsd->pid);
+ }
+
kfree(ruhs);
return 0;
@@ -2289,6 +2315,10 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
"FDP failure status:0x%x\n", ret);
if (ret < 0)
goto out;
+ } else {
+ ns->head->nr_plids = 0;
+ kfree(ns->head->plids);
+ ns->head->plids = NULL;
}
blk_mq_freeze_queue(ns->disk->queue);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5c8bdaa2c8824..4c12d35b3e39e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,6 +495,7 @@ struct nvme_ns_head {
struct gendisk *disk;
u16 nr_plids;
+ u16 *plids;
#ifdef CONFIG_NVME_MULTIPATH
struct bio_list requeue_list;
spinlock_t requeue_lock;
--
2.43.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (9 preceding siblings ...)
2024-12-06 1:53 ` [PATCHv11 10/10] nvme: use fdp streams if write stream is provided Keith Busch
@ 2024-12-06 2:18 ` Keith Busch
2024-12-09 12:51 ` Christoph Hellwig
11 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 2:18 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring,
sagi, asml.silence
On Thu, Dec 05, 2024 at 05:52:58PM -0800, Keith Busch wrote:
> Changes from v10:
>
> Merged up to block for-6.14/io_uring, which required some
> new attribute handling.
>
> Not mixing write hints usage with write streams. This effectively
> abandons any attempts to use the existing fcntl API for use with
> filesystems in this series.
>
> Exporting the stream's reclaim unit nominal size.
>
> Christoph Hellwig (5):
> fs: add a write stream field to the kiocb
> block: add a bi_write_stream field
> block: introduce a write_stream_granularity queue limit
> block: expose write streams for block device nodes
> nvme: add a nvme_get_log_lsi helper
>
> Keith Busch (5):
> io_uring: protection information enhancements
> io_uring: add write stream attribute
> block: introduce max_write_streams queue limit
> nvme: register fdp queue limits
> nvme: use fdp streams if write stream is provided
I fucked up the format-patch command by ommitting a single patch. The
following should have been "PATCH 1/11", but I don't want to resend for
just this:
commit 9e40f4a4da6d0cef871d1c5daf55cc0497fd9c39
Author: Keith Busch <[email protected]>
Date: Tue Nov 19 13:16:15 2024 +0100
fs: add write stream information to statx
Add new statx field to report the maximum number of write streams
supported and the granularity for them.
Signed-off-by: Keith Busch <[email protected]>
[hch: renamed hints to streams, add granularity]
Signed-off-by: Christoph Hellwig <[email protected]>
diff --git a/fs/stat.c b/fs/stat.c
index 0870e969a8a0b..00e4598b1ff25 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -729,6 +729,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
+ tmp.stx_write_stream_granularity = stat->write_stream_granularity;
+ tmp.stx_write_stream_max = stat->write_stream_max;
return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 3d900c86981c5..36d4dfb291abd 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -57,6 +57,8 @@ struct kstat {
u32 atomic_write_unit_min;
u32 atomic_write_unit_max;
u32 atomic_write_segments_max;
+ u32 write_stream_granularity;
+ u16 write_stream_max;
};
/* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 887a252864416..547c62a1a3a7c 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -132,9 +132,11 @@ struct statx {
__u32 stx_atomic_write_unit_max; /* Max atomic write unit in bytes */
/* 0xb0 */
__u32 stx_atomic_write_segments_max; /* Max atomic write segment count */
- __u32 __spare1[1];
+ __u32 stx_write_stream_granularity;
/* 0xb8 */
- __u64 __spare3[9]; /* Spare space for future expansion */
+ __u16 stx_write_stream_max;
+ __u16 __sparse2[3];
+ __u64 __spare3[8]; /* Spare space for future expansion */
/* 0x100 */
};
@@ -164,6 +166,7 @@ struct statx {
#define STATX_MNT_ID_UNIQUE 0x00004000U /* Want/got extended stx_mount_id */
#define STATX_SUBVOL 0x00008000U /* Want/got stx_subvol */
#define STATX_WRITE_ATOMIC 0x00010000U /* Want/got atomic_write_* fields */
+#define STATX_WRITE_STREAM 0x00020000U /* Want/got write_stream_* */
#define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCHv11 09/10] nvme: register fdp queue limits
2024-12-06 1:53 ` [PATCHv11 09/10] nvme: register fdp queue limits Keith Busch
@ 2024-12-06 5:26 ` kernel test robot
0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-12-06 5:26 UTC (permalink / raw)
To: Keith Busch, axboe, hch, linux-block, linux-nvme, linux-fsdevel,
io-uring
Cc: llvm, oe-kbuild-all, sagi, asml.silence, Keith Busch
Hi Keith,
kernel test robot noticed the following build warnings:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on next-20241205]
[cannot apply to brauner-vfs/vfs.all hch-configfs/for-next linus/master v6.13-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/fs-add-a-write-stream-field-to-the-kiocb/20241206-095707
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20241206015308.3342386-10-kbusch%40meta.com
patch subject: [PATCHv11 09/10] nvme: register fdp queue limits
config: i386-buildonly-randconfig-003 (https://download.01.org/0day-ci/archive/20241206/[email protected]/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241206/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
In file included from drivers/nvme/host/core.c:8:
In file included from include/linux/blkdev.h:9:
In file included from include/linux/blk_types.h:10:
In file included from include/linux/bvec.h:10:
In file included from include/linux/highmem.h:8:
In file included from include/linux/cacheflush.h:5:
In file included from arch/x86/include/asm/cacheflush.h:5:
In file included from include/linux/mm.h:2223:
include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> drivers/nvme/host/core.c:2178:18: warning: variable 'h' is uninitialized when used here [-Wuninitialized]
2178 | n = le16_to_cpu(h->numfdpc) + 1;
| ^
include/linux/byteorder/generic.h:91:21: note: expanded from macro 'le16_to_cpu'
91 | #define le16_to_cpu __le16_to_cpu
| ^
include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
| ^
drivers/nvme/host/core.c:2157:36: note: initialize the variable 'h' to silence this warning
2157 | struct nvme_fdp_config_log hdr, *h;
| ^
| = NULL
2 warnings generated.
vim +/h +2178 drivers/nvme/host/core.c
2153
2154 static int nvme_check_fdp(struct nvme_ns *ns, struct nvme_ns_info *info,
2155 u8 fdp_idx)
2156 {
2157 struct nvme_fdp_config_log hdr, *h;
2158 size_t size = sizeof(hdr);
2159 int i, n, ret;
2160 void *log;
2161
2162 info->runs = 0;
2163 ret = nvme_get_log_lsi(ns->ctrl, 0, NVME_LOG_FDP_CONFIG, 0, NVME_CSI_NVM,
2164 (void *)&hdr, size, 0, info->endgid);
2165 if (ret)
2166 return ret;
2167
2168 size = le32_to_cpu(hdr.sze);
2169 log = kzalloc(size, GFP_KERNEL);
2170 if (!log)
2171 return 0;
2172
2173 ret = nvme_get_log_lsi(ns->ctrl, 0, NVME_LOG_FDP_CONFIG, 0, NVME_CSI_NVM,
2174 log, size, 0, info->endgid);
2175 if (ret)
2176 goto out;
2177
> 2178 n = le16_to_cpu(h->numfdpc) + 1;
2179 if (fdp_idx > n)
2180 goto out;
2181
2182 h = log;
2183 log = h->configs;
2184 for (i = 0; i < n; i++) {
2185 struct nvme_fdp_config_desc *config = log;
2186
2187 if (i == fdp_idx) {
2188 info->runs = le64_to_cpu(config->runs);
2189 break;
2190 }
2191 log += le16_to_cpu(config->size);
2192 }
2193 out:
2194 kfree(h);
2195 return ret;
2196 }
2197
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 07/10] block: expose write streams for block device nodes
[not found] ` <CGME20241206091949epcas5p14a01e4cfe614ddd04e23b84f8f1036d5@epcas5p1.samsung.com>
@ 2024-12-06 9:11 ` Nitesh Shetty
0 siblings, 0 replies; 24+ messages in thread
From: Nitesh Shetty @ 2024-12-06 9:11 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring,
sagi, asml.silence, Keith Busch
[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]
On 05/12/24 05:53PM, Keith Busch wrote:
>From: Christoph Hellwig <[email protected]>
>
>Export statx information about the number and granularity of write
>streams, use the per-kiocb write hint and map temperature hints to write
>streams (which is a bit questionable, but this shows how it is done).
>
>Signed-off-by: Christoph Hellwig <[email protected]>
>Signed-off-by: Keith Busch <[email protected]>
>---
> block/bdev.c | 6 ++++++
> block/fops.c | 23 +++++++++++++++++++++++
> 2 files changed, 29 insertions(+)
>
>diff --git a/block/bdev.c b/block/bdev.c
>index 738e3c8457e7f..c23245f1fdfe3 100644
>--- a/block/bdev.c
>+++ b/block/bdev.c
>@@ -1296,6 +1296,12 @@ void bdev_statx(struct path *path, struct kstat *stat,
> stat->result_mask |= STATX_DIOALIGN;
> }
>
>+ if ((request_mask & STATX_WRITE_STREAM) &&
Need to remove a check for at the start of the function for this to
work,
something like this,
- if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
+ if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC |
+ STATX_WRITE_STREAM)))
return;
>+ bdev_max_write_streams(bdev)) {
>+ stat->write_stream_max = bdev_max_write_streams(bdev);
I think write_stream_granularity needs to be added.
stat->write_stream_granularity = bdev_write_stream_granularity(bdev);
Otherwise, patch looks good to me.
--Nitesh Shetty
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 02/10] io_uring: protection information enhancements
[not found] ` <CGME20241206095739epcas5p1ee968cb92c9d4ceb25a79ad80521601f@epcas5p1.samsung.com>
@ 2024-12-06 9:49 ` Anuj Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Anuj Gupta @ 2024-12-06 9:49 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring,
sagi, asml.silence, Keith Busch
[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]
On 05/12/24 05:53PM, Keith Busch wrote:
>From: Keith Busch <[email protected]>
>
>diff --git a/io_uring/rw.c b/io_uring/rw.c
>index 04e4467ab0ee8..a2987aefb2cec 100644
>--- a/io_uring/rw.c
>+++ b/io_uring/rw.c
>@@ -272,14 +272,14 @@ static inline void io_meta_restore(struct io_async_rw *io, struct kiocb *kiocb)
> }
>
> static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
>- u64 attr_ptr, u64 attr_type_mask)
>+ u64 *attr_ptr)
> {
> struct io_uring_attr_pi pi_attr;
> struct io_async_rw *io;
> int ret;
>
>- if (copy_from_user(&pi_attr, u64_to_user_ptr(attr_ptr),
>- sizeof(pi_attr)))
>+ if (copy_from_user(&pi_attr, u64_to_user_ptr(*attr_ptr),
>+ sizeof(pi_attr)))
> return -EFAULT;
>
> if (pi_attr.rsvd)
>@@ -295,6 +295,7 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
> return ret;
> rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
> io_meta_save_state(io);
>+ *attr_ptr += sizeof(pi_attr);
> return ret;
> }
>
>@@ -335,8 +336,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> if (attr_type_mask) {
> u64 attr_ptr;
>
>- /* only PI attribute is supported currently */
>- if (attr_type_mask != IORING_RW_ATTR_FLAG_PI)
>+ if (attr_type_mask & ~IORING_RW_ATTR_FLAGS_SUPPORTED)
> return -EINVAL;
>
> attr_ptr = READ_ONCE(sqe->attr_ptr);
>--
Nit:
Although the next patch does it, but the call to io_prep_rw_pi should
pass a u64 pointer in this patch itself.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 03/10] io_uring: add write stream attribute
[not found] ` <CGME20241206100326epcas5p17d4dad663ccc6c6f40cfab98437e63f3@epcas5p1.samsung.com>
@ 2024-12-06 9:55 ` Anuj Gupta
0 siblings, 0 replies; 24+ messages in thread
From: Anuj Gupta @ 2024-12-06 9:55 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring,
sagi, asml.silence, Keith Busch
[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]
On 05/12/24 05:53PM, Keith Busch wrote:
>From: Keith Busch <[email protected]>
>
>Adds a new attribute type to specify a write stream per-IO.
>
>Signed-off-by: Keith Busch <[email protected]>
>---
> include/uapi/linux/io_uring.h | 9 ++++++++-
> io_uring/rw.c | 28 +++++++++++++++++++++++++++-
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
>diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>index 5fa38467d6070..263cd57aae72d 100644
>--- a/include/uapi/linux/io_uring.h
>+++ b/include/uapi/linux/io_uring.h
>@@ -123,7 +123,14 @@ struct io_uring_attr_pi {
> __u64 rsvd;
> };
>
>-#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI)
>+#define IORING_RW_ATTR_FLAG_WRITE_STREAM (1U << 1)
>+struct io_uring_write_stream {
Nit:
You can consider keeping a io_uring_attr_* prefix here, so that it aligns
with current attribute naming style.
s/io_uring_write_stream/io_uring_attr_write_stream
>+ __u16 write_stream;
>+ __u8 rsvd[6];
>+};
>+
>+#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI | \
>+ IORING_RW_ATTR_FLAG_WRITE_STREAM)
>
> /*
> * If sqe->file_index is set to this for opcodes that instantiate a new
>diff --git a/io_uring/rw.c b/io_uring/rw.c
>index a2987aefb2cec..69b566e296f6d 100644
>--- a/io_uring/rw.c
>+++ b/io_uring/rw.c
>@@ -299,6 +299,22 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
> return ret;
> }
>
>+static int io_prep_rw_write_stream(struct io_rw *rw, u64 *attr_ptr)
>+{
>+ struct io_uring_write_stream write_stream;
>+
>+ if (copy_from_user(&write_stream, u64_to_user_ptr(*attr_ptr),
>+ sizeof(write_stream)))
>+ return -EFAULT;
>+
>+ if (!memchr_inv(write_stream.rsvd, 0, sizeof(write_stream.rsvd)))
This should be:
if (memchr_inv(write_stream.rsvd, 0, sizeof(write_stream.rsvd)))
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 03/10] io_uring: add write stream attribute
2024-12-06 1:53 ` [PATCHv11 03/10] io_uring: add write stream attribute Keith Busch
[not found] ` <CGME20241206100326epcas5p17d4dad663ccc6c6f40cfab98437e63f3@epcas5p1.samsung.com>
@ 2024-12-06 12:44 ` Kanchan Joshi
2024-12-06 16:53 ` Keith Busch
1 sibling, 1 reply; 24+ messages in thread
From: Kanchan Joshi @ 2024-12-06 12:44 UTC (permalink / raw)
To: Keith Busch, axboe, hch, linux-block, linux-nvme, linux-fsdevel,
io-uring
Cc: sagi, asml.silence, Keith Busch
On 12/6/2024 7:23 AM, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Adds a new attribute type to specify a write stream per-IO.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/uapi/linux/io_uring.h | 9 ++++++++-
> io_uring/rw.c | 28 +++++++++++++++++++++++++++-
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 5fa38467d6070..263cd57aae72d 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -123,7 +123,14 @@ struct io_uring_attr_pi {
> __u64 rsvd;
> };
>
> -#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI)
> +#define IORING_RW_ATTR_FLAG_WRITE_STREAM (1U << 1)
> +struct io_uring_write_stream {
> + __u16 write_stream;
> + __u8 rsvd[6];
> +};
So this needs 8 bytes. Maybe passing just 'u16 write_stream' is better?
Or do you expect future additions here (to keep rsvd).
Optimization is possible (now or in future) if it's 4 bytes or smaller,
as that can be placed in SQE along with a new RW attribute flag that
says it's placed inline. Like this -
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@ struct io_uring_sqe {
__u16 addr_len;
__u16 __pad3[1];
};
+ struct {
+ __u16 write_hint;
+ __u16 __rsvd[1];
+ };
};
union {
struct {
@@ -113,6 +117,7 @@ struct io_uring_sqe {
/* sqe->attr_type_mask flags */
#define IORING_RW_ATTR_FLAG_PI (1U << 0)
+#define IORING_RW_ATTR_FLAG_WRITE_STREAM_INLINE (1U << 1)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 10/10] nvme: use fdp streams if write stream is provided
2024-12-06 1:53 ` [PATCHv11 10/10] nvme: use fdp streams if write stream is provided Keith Busch
@ 2024-12-06 13:18 ` kernel test robot
0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-12-06 13:18 UTC (permalink / raw)
To: Keith Busch, axboe, hch, linux-block, linux-nvme, linux-fsdevel,
io-uring
Cc: oe-kbuild-all, sagi, asml.silence, Keith Busch
Hi Keith,
kernel test robot noticed the following build warnings:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on next-20241205]
[cannot apply to brauner-vfs/vfs.all hch-configfs/for-next linus/master v6.13-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/fs-add-a-write-stream-field-to-the-kiocb/20241206-095707
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20241206015308.3342386-11-kbusch%40meta.com
patch subject: [PATCHv11 10/10] nvme: use fdp streams if write stream is provided
config: i386-randconfig-061 (https://download.01.org/0day-ci/archive/20241206/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241206/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
sparse warnings: (new ones prefixed by >>)
drivers/nvme/host/core.c: note: in included file (through drivers/nvme/host/nvme.h):
include/linux/nvme.h:790:44: sparse: sparse: array of flexible structures
>> drivers/nvme/host/core.c:2261:34: sparse: sparse: cast to restricted __le16
drivers/nvme/host/core.c: note: in included file (through include/linux/async.h):
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
vim +2261 drivers/nvme/host/core.c
2209
2210 static int nvme_query_fdp_info(struct nvme_ns *ns, struct nvme_ns_info *info)
2211 {
2212 struct nvme_fdp_ruh_status_desc *ruhsd;
2213 struct nvme_ns_head *head = ns->head;
2214 struct nvme_fdp_ruh_status *ruhs;
2215 struct nvme_command c = {};
2216 u32 fdp, fdp_idx;
2217 int size, ret, i;
2218
2219 ret = nvme_get_features(ns->ctrl, NVME_FEAT_FDP, info->endgid, NULL, 0,
2220 &fdp);
2221 if (ret)
2222 goto err;
2223
2224 if (!(fdp & NVME_FDP_FDPE))
2225 goto err;
2226
2227 fdp_idx = (fdp >> NVME_FDP_FDPCIDX_SHIFT) & NVME_FDP_FDPCIDX_MASK;
2228 ret = nvme_check_fdp(ns, info, fdp_idx);
2229 if (ret || !info->runs)
2230 goto err;
2231
2232 size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
2233 ruhs = kzalloc(size, GFP_KERNEL);
2234 if (!ruhs) {
2235 ret = -ENOMEM;
2236 goto err;
2237 }
2238
2239 c.imr.opcode = nvme_cmd_io_mgmt_recv;
2240 c.imr.nsid = cpu_to_le32(head->ns_id);
2241 c.imr.mo = NVME_IO_MGMT_RECV_MO_RUHS;
2242 c.imr.numd = cpu_to_le32(nvme_bytes_to_numd(size));
2243 ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
2244 if (ret)
2245 goto free;
2246
2247 head->nr_plids = le16_to_cpu(ruhs->nruhsd);
2248 if (!head->nr_plids)
2249 goto free;
2250
2251 head->nr_plids = min(head->nr_plids, NVME_MAX_PLIDS);
2252 head->plids = kcalloc(head->nr_plids, sizeof(head->plids),
2253 GFP_KERNEL);
2254 if (!head->plids) {
2255 ret = -ENOMEM;
2256 goto free;
2257 }
2258
2259 for (i = 0; i < head->nr_plids; i++) {
2260 ruhsd = &ruhs->ruhsd[i];
> 2261 head->plids[i] = le16_to_cpu(ruhsd->pid);
2262 }
2263
2264 kfree(ruhs);
2265 return 0;
2266
2267 free:
2268 kfree(ruhs);
2269 err:
2270 head->nr_plids = 0;
2271 info->runs = 0;
2272 return ret;
2273 }
2274
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 03/10] io_uring: add write stream attribute
2024-12-06 12:44 ` Kanchan Joshi
@ 2024-12-06 16:53 ` Keith Busch
0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-06 16:53 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Keith Busch, axboe, hch, linux-block, linux-nvme, linux-fsdevel,
io-uring, sagi, asml.silence
On Fri, Dec 06, 2024 at 06:14:29PM +0530, Kanchan Joshi wrote:
> On 12/6/2024 7:23 AM, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> >
> > Adds a new attribute type to specify a write stream per-IO.
> >
> > Signed-off-by: Keith Busch <[email protected]>
> > ---
> > include/uapi/linux/io_uring.h | 9 ++++++++-
> > io_uring/rw.c | 28 +++++++++++++++++++++++++++-
> > 2 files changed, 35 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > index 5fa38467d6070..263cd57aae72d 100644
> > --- a/include/uapi/linux/io_uring.h
> > +++ b/include/uapi/linux/io_uring.h
> > @@ -123,7 +123,14 @@ struct io_uring_attr_pi {
> > __u64 rsvd;
> > };
> >
> > -#define IORING_RW_ATTR_FLAGS_SUPPORTED (IORING_RW_ATTR_FLAG_PI)
> > +#define IORING_RW_ATTR_FLAG_WRITE_STREAM (1U << 1)
> > +struct io_uring_write_stream {
> > + __u16 write_stream;
> > + __u8 rsvd[6];
> > +};
>
> So this needs 8 bytes. Maybe passing just 'u16 write_stream' is better?
> Or do you expect future additions here (to keep rsvd).
I don't have any plans to use it. It's just padded for alignment. I am
not sure what future attributes might be proposed, but I don't want to
force them be align to a 2-byte boundary.
> Optimization is possible (now or in future) if it's 4 bytes or smaller,
> as that can be placed in SQE along with a new RW attribute flag that
> says it's placed inline. Like this -
Oh, that's definitely preferred IMO, because it is that much easier to
reach the capability. Previous versions of this proposal had the field
in the next union, so I for some reason this union you're showing here
was unavailable for new fields, but it looks like it's unused for
read/write. So, yeah, let's put it in the sqe if there's no conflict
here.
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -92,6 +92,10 @@ struct io_uring_sqe {
> __u16 addr_len;
> __u16 __pad3[1];
> };
> + struct {
> + __u16 write_hint;
> + __u16 __rsvd[1];
> + };
> };
> union {
> struct {
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
` (10 preceding siblings ...)
2024-12-06 2:18 ` [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
@ 2024-12-09 12:51 ` Christoph Hellwig
2024-12-09 15:57 ` Keith Busch
2024-12-09 17:14 ` [EXT] " Pierre Labat
11 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-12-09 12:51 UTC (permalink / raw)
To: Keith Busch
Cc: axboe, hch, linux-block, linux-nvme, linux-fsdevel, io-uring,
sagi, asml.silence, Keith Busch
Note: I skipped back to this because v12 only had the log vs v11.
On Thu, Dec 05, 2024 at 05:52:58PM -0800, Keith Busch wrote:
>
> Not mixing write hints usage with write streams. This effectively
> abandons any attempts to use the existing fcntl API for use with
> filesystems in this series.
That's not true as far as I can tell given that this is basically the
architecture from my previous posting. The block code still maps the
rw hints into write streams, and file systems can do exactly the same.
You just need to talk to the fs maintainers and convince them it's a
good thing for their particular file system. Especially for simple
file systems that task should not be too hard, even if they might want
to set a stream or two aside for fs usage. Similarly a file system
can implement the stream based API.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-09 12:51 ` Christoph Hellwig
@ 2024-12-09 15:57 ` Keith Busch
2024-12-09 17:14 ` [EXT] " Pierre Labat
1 sibling, 0 replies; 24+ messages in thread
From: Keith Busch @ 2024-12-09 15:57 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, axboe, linux-block, linux-nvme, linux-fsdevel,
io-uring, sagi, asml.silence
On Mon, Dec 09, 2024 at 01:51:32PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 05, 2024 at 05:52:58PM -0800, Keith Busch wrote:
> >
> > Not mixing write hints usage with write streams. This effectively
> > abandons any attempts to use the existing fcntl API for use with
> > filesystems in this series.
>
> That's not true as far as I can tell given that this is basically the
> architecture from my previous posting. The block code still maps the
> rw hints into write streams, and file systems can do exactly the same.
> You just need to talk to the fs maintainers and convince them it's a
> good thing for their particular file system. Especially for simple
> file systems that task should not be too hard, even if they might want
> to set a stream or two aside for fs usage. Similarly a file system
> can implement the stream based API.
Sorry for my confusing message here. I meant *this series* doesn't
attempt to use streams with filesystems (I wasn't considering raw block
in the same catagory as a traditional filesystems).
I am not abandoning follow on efforst to make use of these elsewhere. I
just don't want the open topics to distract from the less controversial
parts, and this series doesn't prevent or harm future innovations there,
so I think we're pretty well aligned up to this point.
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [EXT] Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-09 12:51 ` Christoph Hellwig
2024-12-09 15:57 ` Keith Busch
@ 2024-12-09 17:14 ` Pierre Labat
2024-12-09 17:25 ` Keith Busch
1 sibling, 1 reply; 24+ messages in thread
From: Pierre Labat @ 2024-12-09 17:14 UTC (permalink / raw)
To: Christoph Hellwig, Keith Busch
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], Keith Busch
Micron Confidential
Hi,
I was under the impression that passing write hints via fcntl() on any legacy filesystem stays. The hint is attached to the inode, and the fs simply picks it up from there when sending it down with write related to that inode.
Aka per file write hint.
I am right?
Pierre
Micron Confidential
> -----Original Message-----
> From: Christoph Hellwig <[email protected]>
> Sent: Monday, December 9, 2024 4:52 AM
> To: Keith Busch <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; io-
> [email protected]; [email protected]; [email protected]; Keith
> Busch <[email protected]>
> Subject: [EXT] Re: [PATCHv11 00/10] block write streams with nvme fdp
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
>
>
> Note: I skipped back to this because v12 only had the log vs v11.
>
> On Thu, Dec 05, 2024 at 05:52:58PM -0800, Keith Busch wrote:
> >
> > Not mixing write hints usage with write streams. This effectively
> > abandons any attempts to use the existing fcntl API for use with
> > filesystems in this series.
>
> That's not true as far as I can tell given that this is basically the architecture
> from my previous posting. The block code still maps the rw hints into write
> streams, and file systems can do exactly the same.
> You just need to talk to the fs maintainers and convince them it's a good thing
> for their particular file system. Especially for simple file systems that task
> should not be too hard, even if they might want to set a stream or two aside
> for fs usage. Similarly a file system can implement the stream based API.
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [EXT] Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-09 17:14 ` [EXT] " Pierre Labat
@ 2024-12-09 17:25 ` Keith Busch
2024-12-09 17:35 ` Pierre Labat
0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2024-12-09 17:25 UTC (permalink / raw)
To: Pierre Labat
Cc: Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On Mon, Dec 09, 2024 at 05:14:16PM +0000, Pierre Labat wrote:
> I was under the impression that passing write hints via fcntl() on any
> legacy filesystem stays. The hint is attached to the inode, and the fs
> simply picks it up from there when sending it down with write related
> to that inode.
> Aka per file write hint.
>
> I am right?
Nothing is changing with respect to those write hints as a result of
this series, if that's what you mean. The driver hadn't been checking
the write hint before, and this patch set continues that pre-existing
behavior. For this series, the driver utilizes a new field:
"write_stream".
Mapping the inode write hint to an FDP stream for other filesystems
remains an open topic to follow on later.
^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [EXT] Re: [PATCHv11 00/10] block write streams with nvme fdp
2024-12-09 17:25 ` Keith Busch
@ 2024-12-09 17:35 ` Pierre Labat
0 siblings, 0 replies; 24+ messages in thread
From: Pierre Labat @ 2024-12-09 17:35 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Keith Busch, [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
Thanks Keith for the clarification.
If I got it right, that will be decided later by the filesystem maintainers if they went to convert the write hint assigned to a file via fcntl() into a write_stream that is the one used by the block drivers (for FDP for nvme).
Regards,
Pierre
> -----Original Message-----
> From: Keith Busch <[email protected]>
> Sent: Monday, December 9, 2024 9:25 AM
> To: Pierre Labat <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Keith Busch <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; io-
> [email protected]; [email protected]; [email protected]
> Subject: Re: [EXT] Re: [PATCHv11 00/10] block write streams with nvme fdp
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
>
>
> On Mon, Dec 09, 2024 at 05:14:16PM +0000, Pierre Labat wrote:
> > I was under the impression that passing write hints via fcntl() on any
> > legacy filesystem stays. The hint is attached to the inode, and the fs
> > simply picks it up from there when sending it down with write related
> > to that inode.
> > Aka per file write hint.
> >
> > I am right?
>
> Nothing is changing with respect to those write hints as a result of this series,
> if that's what you mean. The driver hadn't been checking the write hint before,
> and this patch set continues that pre-existing behavior. For this series, the
> driver utilizes a new field:
> "write_stream".
>
> Mapping the inode write hint to an FDP stream for other filesystems remains
> an open topic to follow on later.
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-12-09 17:35 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 1:52 [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
2024-12-06 1:52 ` [PATCHv11 01/10] fs: add a write stream field to the kiocb Keith Busch
2024-12-06 1:53 ` [PATCHv11 02/10] io_uring: protection information enhancements Keith Busch
[not found] ` <CGME20241206095739epcas5p1ee968cb92c9d4ceb25a79ad80521601f@epcas5p1.samsung.com>
2024-12-06 9:49 ` Anuj Gupta
2024-12-06 1:53 ` [PATCHv11 03/10] io_uring: add write stream attribute Keith Busch
[not found] ` <CGME20241206100326epcas5p17d4dad663ccc6c6f40cfab98437e63f3@epcas5p1.samsung.com>
2024-12-06 9:55 ` Anuj Gupta
2024-12-06 12:44 ` Kanchan Joshi
2024-12-06 16:53 ` Keith Busch
2024-12-06 1:53 ` [PATCHv11 04/10] block: add a bi_write_stream field Keith Busch
2024-12-06 1:53 ` [PATCHv11 05/10] block: introduce max_write_streams queue limit Keith Busch
2024-12-06 1:53 ` [PATCHv11 06/10] block: introduce a write_stream_granularity " Keith Busch
2024-12-06 1:53 ` [PATCHv11 07/10] block: expose write streams for block device nodes Keith Busch
[not found] ` <CGME20241206091949epcas5p14a01e4cfe614ddd04e23b84f8f1036d5@epcas5p1.samsung.com>
2024-12-06 9:11 ` Nitesh Shetty
2024-12-06 1:53 ` [PATCHv11 08/10] nvme: add a nvme_get_log_lsi helper Keith Busch
2024-12-06 1:53 ` [PATCHv11 09/10] nvme: register fdp queue limits Keith Busch
2024-12-06 5:26 ` kernel test robot
2024-12-06 1:53 ` [PATCHv11 10/10] nvme: use fdp streams if write stream is provided Keith Busch
2024-12-06 13:18 ` kernel test robot
2024-12-06 2:18 ` [PATCHv11 00/10] block write streams with nvme fdp Keith Busch
2024-12-09 12:51 ` Christoph Hellwig
2024-12-09 15:57 ` Keith Busch
2024-12-09 17:14 ` [EXT] " Pierre Labat
2024-12-09 17:25 ` Keith Busch
2024-12-09 17:35 ` Pierre Labat
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox