public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v7 0/3] FDP and per-io hints
       [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com>
@ 2024-09-30 18:13 ` Kanchan Joshi
       [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
                     ` (4 more replies)
  0 siblings, 5 replies; 44+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi

Another spin to incorporate the feedback from LPC and previous
iterations. The series adds two capabilities:
- FDP support at NVMe level (patch #1)
- Per-io hinting via io_uring (patch #3)
Patch #2 is needed to do per-io hints.

The motivation and interface details are present in the commit
descriptions.

Testing:
Done with fcntl and liburing based custom applications.
On raw block device, ext4, xfs, btrfs and F2FS.
Checked that no regression occurs for application that use per-inode
hints.
Checked that per-io hints, when passed, take the precedence over per-inode
hints.

Changes since v6:
- Change io_uring interface to pass hints as SQE metadata (Pavel, Hannes)

Changes since v5:
- Drop placement hints
- Add per-io hint interface

Changes since v4:
- Retain the size/type checking on the enum (Bart)
- Use the name "*_lifetime_hint" rather than "*_life_hint" (Bart)

Changes since v3:
- 4 new patches to introduce placement hints
- Make nvme patch use the placement hints rather than lifetime hints

Changes since v2:
- Base it on nvme-6.11 and resolve a merge conflict

Changes since v1:
- Reduce the fetched plids from 128 to 6 (Keith)
- Use struct_size for a calculation (Keith)
- Handle robot/sparse warning

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

 block/fops.c                  |  6 +--
 drivers/nvme/host/core.c      | 70 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  4 ++
 fs/aio.c                      |  1 +
 fs/cachefiles/io.c            |  1 +
 fs/direct-io.c                |  2 +-
 fs/fcntl.c                    | 22 -----------
 fs/iomap/direct-io.c          |  2 +-
 include/linux/fs.h            |  8 ++++
 include/linux/nvme.h          | 19 ++++++++++
 include/linux/rw_hint.h       | 24 ++++++++++++
 include/uapi/linux/io_uring.h | 19 ++++++++++
 io_uring/rw.c                 | 24 ++++++++++++
 13 files changed, 175 insertions(+), 27 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v7 1/3] nvme: enable FDP support
       [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  2024-10-02 18:37       ` Bart Van Assche
  0 siblings, 1 reply; 44+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Hui Qi,
	Nitesh Shetty

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.

Userspace can send the data lifetime information using the write hints.
The SCSI driver (sd) can already pass this information to the SCSI
devices. This patch does the same for NVMe.

Fetch the placement-identifiers if the device supports FDP.
The incoming write-hint is mapped to a placement-identifier, which in
turn is set in the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Nacked-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
---
 drivers/nvme/host/core.c | 70 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 +++
 include/linux/nvme.h     | 19 +++++++++++
 3 files changed, 93 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba6508455e18..ad5cc1ec8c4f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
 	bool is_removed;
 };
 
+struct nvme_fdp_ruh_status_desc {
+	u16 pid;
+	u16 ruhid;
+	u32 earutr;
+	u64 ruamw;
+	u8  rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+	u8  rsvd0[14];
+	__le16 nruhsd;
+	struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -959,6 +973,19 @@ static bool nvme_valid_atomic_write(struct request *req)
 	return true;
 }
 
+static inline void nvme_assign_placement_id(struct nvme_ns *ns,
+					struct request *req,
+					struct nvme_command *cmd)
+{
+	enum rw_hint h = req->write_hint;
+
+	if (h >= ns->head->nr_plids)
+		return;
+
+	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
+	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -1078,6 +1105,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		break;
 	case REQ_OP_WRITE:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (!ret && ns->head->nr_plids)
+			nvme_assign_placement_id(ns, req, cmd);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -2114,6 +2143,40 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	return ret;
 }
 
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+	struct nvme_command c = {};
+	struct nvme_fdp_ruh_status *ruhs;
+	struct nvme_fdp_ruh_status_desc *ruhsd;
+	int size, ret, i;
+
+	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+	ruhs = kzalloc(size, GFP_KERNEL);
+	if (!ruhs)
+		return -ENOMEM;
+
+	c.imr.opcode = nvme_cmd_io_mgmt_recv;
+	c.imr.nsid = cpu_to_le32(nsid);
+	c.imr.mo = 0x1;
+	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+	if (ret)
+		goto out;
+
+	ns->head->nr_plids = le16_to_cpu(ruhs->nruhsd);
+	ns->head->nr_plids =
+		min_t(u16, ns->head->nr_plids, NVME_MAX_PLIDS);
+
+	for (i = 0; i < ns->head->nr_plids; i++) {
+		ruhsd = &ruhs->ruhsd[i];
+		ns->head->plids[i] = le16_to_cpu(ruhsd->pid);
+	}
+out:
+	kfree(ruhs);
+	return ret;
+}
+
 static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		struct nvme_ns_info *info)
 {
@@ -2205,6 +2268,13 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		if (ret && !nvme_first_scan(ns->disk))
 			goto out;
 	}
+	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+		ret = nvme_fetch_fdp_plids(ns, info->nsid);
+		if (ret)
+			dev_warn(ns->ctrl->device,
+				"FDP failure status:0x%x\n", ret);
+	}
+
 
 	ret = 0;
 out:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 313a4f978a2c..a959a9859e8b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+#define NVME_MAX_PLIDS   (WRITE_LIFE_EXTREME + 1)
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,8 @@ struct nvme_ns_head {
 	struct device		cdev_device;
 
 	struct gendisk		*disk;
+	u16			nr_plids;
+	u16			plids[NVME_MAX_PLIDS];
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e..a954eaee5b0f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
 	NVME_CTRL_ATTR_HID_128_BIT	= (1 << 0),
 	NVME_CTRL_ATTR_TBKAS		= (1 << 6),
 	NVME_CTRL_ATTR_ELBAS		= (1 << 15),
+	NVME_CTRL_ATTR_FDPS		= (1 << 19),
 };
 
 struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_register	= 0x0d,
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_io_mgmt_recv	= 0x12,
 	nvme_cmd_resv_release	= 0x15,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
+		nvme_opcode_name(nvme_cmd_io_mgmt_recv),	\
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
@@ -1015,6 +1018,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
 	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+	NVME_RW_DTYPE_DPLCMT		= 2 << 4,
 	NVME_WZ_DEAC			= 1 << 9,
 };
 
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_io_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			mo;
+	__u8			rsvd11;
+	__u16			mos;
+	__le32			numd;
+	__le32			cdw12[4];
+};
+
 enum {
 	NVME_ZRA_ZONE_REPORT		= 0,
 	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v7 2/3] block, fs: restore kiocb based write hint processing
       [not found]   ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  0 siblings, 0 replies; 44+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
("fs: remove kiocb.ki_hint").
But write hint has made a comeback with commit 449813515d3e ("block, fs:
Restore the per-bio/request data lifetime fields").

This patch uses the leftover space in kiocb to carve 1 byte field
ki_write_hint.
Restore the code that operates on kiocb to use ki_write_hint instead of
inode hint value.

This does not bring any behavior change, but needed to enable per-io
hints (by another patch).

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
---
 block/fops.c         | 6 +++---
 fs/aio.c             | 1 +
 fs/cachefiles/io.c   | 1 +
 fs/direct-io.c       | 2 +-
 fs/iomap/direct-io.c | 2 +-
 include/linux/fs.h   | 8 ++++++++
 io_uring/rw.c        | 1 +
 7 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index e696ae53bf1e..85b9b97d372c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
diff --git a/fs/aio.c b/fs/aio.c
index e8920178b50f..db618817e670 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
 	req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
 		req->ki_flags |= IOCB_EVENTFD;
+	req->ki_write_hint = file_write_hint(req->ki_filp);
 	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
 		/*
 		 * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 6a821a959b59..c3db102ae64e 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object,
 	ki->iocb.ki_pos		= start_pos;
 	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE;
 	ki->iocb.ki_ioprio	= get_current_ioprio();
+	ki->iocb.ki_write_hint  = file_write_hint(file);
 	ki->object		= object;
 	ki->start		= start_pos;
 	ki->len			= len;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index bbd05f1a2145..73629e26becb 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_io;
 	if (dio->is_pinned)
 		bio_set_flag(bio, BIO_PAGE_PINNED);
-	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = dio->iocb->ki_write_hint;
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f637aa0706a3..fff43f121ee6 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
-		bio->bi_write_hint = inode->i_write_hint;
+		bio->bi_write_hint = dio->iocb->ki_write_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337..3dfe6de7b611 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	enum rw_hint		ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
@@ -2337,12 +2338,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
 	       !vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
 }
 
+static inline enum rw_hint file_write_hint(struct file *filp)
+{
+	return file_inode(filp)->i_write_hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
 		.ki_filp = filp,
 		.ki_flags = filp->f_iocb_flags,
 		.ki_ioprio = get_current_ioprio(),
+		.ki_write_hint = file_write_hint(filp),
 	};
 }
 
@@ -2353,6 +2360,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 		.ki_filp = filp,
 		.ki_flags = kiocb_src->ki_flags,
 		.ki_ioprio = kiocb_src->ki_ioprio,
+		.ki_write_hint = kiocb_src->ki_write_hint,
 		.ki_pos = kiocb_src->ki_pos,
 	};
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f023ff49c688..510123d3d837 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1023,6 +1023,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
+	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v7 3/3] io_uring: enable per-io hinting capability
       [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  2024-10-02 14:26       ` Pavel Begunkov
  2024-10-02 18:29       ` Bart Van Assche
  0 siblings, 2 replies; 44+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down.
This can be limiting for large files (and for block device) as all the
writes can be tagged with only one lifetime hint value.
Concurrent writes (with different hint values) are hard to manage.
Per-IO hinting solves that problem.

Allow userspace to pass additional metadata in the SQE.
The type of passed metadata is expressed by a new field

	__u16 meta_type;

At this point one type META_TYPE_LIFETIME_HINT is supported.
With this type, user can pass lifetime hint values in the new field

	__u64 lifetime_val;

This accepts all lifetime hint values that are possible with
F_SET_RW_HINT fcntl.

The write handlers (io_prep_rw, io_write) send the hint value to
lower-layer using kiocb. This is good for upporting direct IO,
but not when kiocb is not available (e.g., buffered IO).

When per-io hints are not passed, the per-inode hint values are set in
the kiocb (as before). Otherwise, these take the precedence on per-inode
hints.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 fs/fcntl.c                    | 22 ----------------------
 include/linux/rw_hint.h       | 24 ++++++++++++++++++++++++
 include/uapi/linux/io_uring.h | 19 +++++++++++++++++++
 io_uring/rw.c                 | 25 ++++++++++++++++++++++++-
 4 files changed, 67 insertions(+), 23 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22dd9dcce7ec..a390a05f4ef8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
-static bool rw_hint_valid(u64 hint)
-{
-	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
-	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
-	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
-	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
-	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
-	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
-
-	switch (hint) {
-	case RWH_WRITE_LIFE_NOT_SET:
-	case RWH_WRITE_LIFE_NONE:
-	case RWH_WRITE_LIFE_SHORT:
-	case RWH_WRITE_LIFE_MEDIUM:
-	case RWH_WRITE_LIFE_LONG:
-	case RWH_WRITE_LIFE_EXTREME:
-		return true;
-	default:
-		return false;
-	}
-}
-
 static long fcntl_get_rw_hint(struct file *file, unsigned int cmd,
 			      unsigned long arg)
 {
diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
index 309ca72f2dfb..f4373a71ffed 100644
--- a/include/linux/rw_hint.h
+++ b/include/linux/rw_hint.h
@@ -21,4 +21,28 @@ enum rw_hint {
 static_assert(sizeof(enum rw_hint) == 1);
 #endif
 
+#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
+
+static inline bool rw_hint_valid(u64 hint)
+{
+	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
+	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
+	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
+	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
+	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
+	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
+
+	switch (hint) {
+	case RWH_WRITE_LIFE_NOT_SET:
+	case RWH_WRITE_LIFE_NONE:
+	case RWH_WRITE_LIFE_SHORT:
+	case RWH_WRITE_LIFE_MEDIUM:
+	case RWH_WRITE_LIFE_LONG:
+	case RWH_WRITE_LIFE_EXTREME:
+		return true;
+	default:
+		return false;
+	}
+}
+
 #endif /* _LINUX_RW_HINT_H */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..951e35226229 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,12 +92,23 @@ struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			/* Bit field to express 16 meta types */
+			__u16	meta_type;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
 			__u64	addr3;
 			__u64	__pad2[1];
 		};
+		struct {
+			/* First meta type specific fields */
+			__u64	lifetime_val;
+			/* For future use */
+			__u64	__pad5[1];
+		};
 		__u64	optval;
 		/*
 		 * If the ring is initialized with IORING_SETUP_SQE128, then
@@ -107,6 +118,14 @@ struct io_uring_sqe {
 	};
 };
 
+enum io_uring_sqe_meta_type_bits {
+	META_TYPE_LIFETIME_HINT_BIT
+};
+
+/* this meta type covers write hint values supported by F_SET_RW_HINT fcntl */
+#define META_TYPE_LIFETIME_HINT	(1U << META_TYPE_LIFETIME_HINT_BIT)
+
+
 /*
  * If sqe->file_index is set to this for opcodes that instantiate a new
  * direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 510123d3d837..bf45ee8904a4 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,24 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	if (ddir == ITER_SOURCE) {
+		u16 mtype = READ_ONCE(sqe->meta_type);
+
+		rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID;
+		if (mtype) {
+			u64 lhint = READ_ONCE(sqe->lifetime_val);
+
+			if (READ_ONCE(sqe->__pad4[0]) ||
+			    READ_ONCE(sqe->__pad5[0]))
+				return -EINVAL;
+
+			if (mtype != META_TYPE_LIFETIME_HINT ||
+			    !rw_hint_valid(lhint))
+				return -EINVAL;
+
+			rw->kiocb.ki_write_hint = lhint;
+		}
+	}
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
@@ -1023,7 +1041,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
-	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
+	/*
+	 * Use per-file hint only if per-io hint is not set.
+	 * We need per-io hint to get precedence.
+	 */
+	if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID)
+		rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
                     ` (2 preceding siblings ...)
       [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
@ 2024-10-01  9:20   ` Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
                       ` (2 more replies)
  2024-10-03  0:20   ` Bart Van Assche
  4 siblings, 3 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-01  9:20 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

Any reason you completely ignored my feedback on the last version
and did not even answer?

That's not a very productive way to work.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
@ 2024-10-01 15:58     ` James R. Bergsten
  2024-10-01 16:18     ` Jens Axboe
  2024-10-01 16:23     ` Keith Busch
  2 siblings, 0 replies; 44+ messages in thread
From: James R. Bergsten @ 2024-10-01 15:58 UTC (permalink / raw)
  To: 'Christoph Hellwig', 'Kanchan Joshi'
  Cc: axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

Now THIS Is the NVMe I came to know and love.  😊

-----Original Message-----
From: Linux-nvme <[email protected]> On Behalf Of Christoph Hellwig
Sent: Tuesday, October 1, 2024 2:21 AM
To: Kanchan Joshi <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH v7 0/3] FDP and per-io hints

Any reason you completely ignored my feedback on the last version and did not even answer?

That's not a very productive way to work.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
@ 2024-10-01 16:18     ` Jens Axboe
  2024-10-02  7:51       ` Christoph Hellwig
  2024-10-01 16:23     ` Keith Busch
  2 siblings, 1 reply; 44+ messages in thread
From: Jens Axboe @ 2024-10-01 16:18 UTC (permalink / raw)
  To: Christoph Hellwig, Kanchan Joshi
  Cc: kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk,
	bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/1/24 3:20 AM, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?

Probably because he got a little tired of dealing with the bullshit
related to this topic.

> That's not a very productive way to work.

Have to say, that neither have your responses been. Can't really fault
people for just giving up at some point, when no productive end seems to
be in sight.

As far as I'm concerned, this looks fine to me. There are customers
wanting to use us, and folks making drives that support it. It's not
right to continually gatekeep this feature just because you don't like
it.

I need to review the io_uring bits, but that should be the least of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
  2024-10-01 16:18     ` Jens Axboe
@ 2024-10-01 16:23     ` Keith Busch
  2024-10-02  7:49       ` Christoph Hellwig
  2 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2024-10-01 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 11:20:48AM +0200, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?
> 
> That's not a very productive way to work.

I think because he's getting conflicting feedback. The arguments against
it being that it's not a good match, however it's the same match created
for streams, and no one complained then, and it's an existing user ABI.
FDP is a the new "streams", so I really don't see a big deal here. I
understand you have use cases that have filesystems intercept the hints
for different types of devices, but I don't see these use cases as
mutally exclusive. Supporting this one doesn't prevent other uses.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01 16:23     ` Keith Busch
@ 2024-10-02  7:49       ` Christoph Hellwig
  2024-10-02 14:56         ` Keith Busch
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-02  7:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> I think because he's getting conflicting feedback. The arguments against
> it being that it's not a good match, however it's the same match created
> for streams, and no one complained then, and it's an existing user ABI.

People complained, and the streams code got removed pretty quickly
because it did not work as expected.  I don't think that counts as
a big success.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01 16:18     ` Jens Axboe
@ 2024-10-02  7:51       ` Christoph Hellwig
  2024-10-02 15:03         ` Jens Axboe
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-02  7:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
> Have to say, that neither have your responses been. Can't really fault
> people for just giving up at some point, when no productive end seems to
> be in sight.

I heavily disagree and take offence on that.

The previous stream separation approach made total sense, but just
needed a fair amount of work.  But it closely matches how things work
at the hardware and file system level, so it was the right approach.

Suddenly dropping that and ignoring all comments really feels like
someone urgently needs to pull a marketing stunt here.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability
  2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
@ 2024-10-02 14:26       ` Pavel Begunkov
  2024-10-02 18:29       ` Bart Van Assche
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Begunkov @ 2024-10-02 14:26 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/30/24 19:13, Kanchan Joshi wrote:
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down.
> This can be limiting for large files (and for block device) as all the
> writes can be tagged with only one lifetime hint value.
> Concurrent writes (with different hint values) are hard to manage.
> Per-IO hinting solves that problem.
> 
> Allow userspace to pass additional metadata in the SQE.
> The type of passed metadata is expressed by a new field
> 
> 	__u16 meta_type;

The new layout looks nicer, but let me elaborate on the previous
comment. I don't believe we should be restricting to only one
attribute per IO. What if someone wants to pass a lifetime hint
together with integrity information?

Instead, we might need something more extensible like an ability
to pass a list / array of typed attributes / meta information / hints
etc. An example from networking I gave last time was control messages,
i.e. cmsg. In a basic oversimplified form the API from the user
perspective could look like:

struct meta_attr {
	u16 type;
	u64 data;
};

struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}};
sqe->meta_attrs = attr;
sqe->meta_nr = 2;

I'm pretty sure there will be a bunch of considerations people
will name the API should account for, like sizing the struct
so it can fit integrity bits or even making it variable sized.


> At this point one type META_TYPE_LIFETIME_HINT is supported.
> With this type, user can pass lifetime hint values in the new field
> 
> 	__u64 lifetime_val;
> 
> This accepts all lifetime hint values that are possible with
> F_SET_RW_HINT fcntl.
> 
> The write handlers (io_prep_rw, io_write) send the hint value to
> lower-layer using kiocb. This is good for upporting direct IO,
> but not when kiocb is not available (e.g., buffered IO).
> 
> When per-io hints are not passed, the per-inode hint values are set in
> the kiocb (as before). Otherwise, these take the precedence on per-inode
> hints.
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02  7:49       ` Christoph Hellwig
@ 2024-10-02 14:56         ` Keith Busch
  2024-10-02 15:00           ` Jens Axboe
  0 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2024-10-02 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> > I think because he's getting conflicting feedback. The arguments against
> > it being that it's not a good match, however it's the same match created
> > for streams, and no one complained then, and it's an existing user ABI.
> 
> People complained, and the streams code got removed pretty quickly
> because it did not work as expected.  I don't think that counts as
> a big success.

I don't think the kernel API was the problem. Capable devices never
materialized, so the code wasn't doing anything useful.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 14:56         ` Keith Busch
@ 2024-10-02 15:00           ` Jens Axboe
  0 siblings, 0 replies; 44+ messages in thread
From: Jens Axboe @ 2024-10-02 15:00 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/2/24 8:56 AM, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
>> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
>>> I think because he's getting conflicting feedback. The arguments against
>>> it being that it's not a good match, however it's the same match created
>>> for streams, and no one complained then, and it's an existing user ABI.
>>
>> People complained, and the streams code got removed pretty quickly
>> because it did not work as expected.  I don't think that counts as
>> a big success.
> 
> I don't think the kernel API was the problem. Capable devices never
> materialized, so the code wasn't doing anything useful.

Exactly. I never saw ones that kept the stream persistent across GC,
and hence they ended up being pretty useless in practice.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02  7:51       ` Christoph Hellwig
@ 2024-10-02 15:03         ` Jens Axboe
  2024-10-02 15:13           ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Jens Axboe @ 2024-10-02 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/2/24 1:51 AM, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
>> Have to say, that neither have your responses been. Can't really fault
>> people for just giving up at some point, when no productive end seems to
>> be in sight.
> 
> I heavily disagree and take offence on that.
> 
> The previous stream separation approach made total sense, but just
> needed a fair amount of work.  But it closely matches how things work
> at the hardware and file system level, so it was the right approach.

What am I missing that makes this effort that different from streams?
Both are a lifetime hint.

> Suddenly dropping that and ignoring all comments really feels like
> someone urgently needs to pull a marketing stunt here.

I think someone is just trying to finally get this feature in, so it
can get used by customers, after many months of fairly unproductive
discussion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:03         ` Jens Axboe
@ 2024-10-02 15:13           ` Christoph Hellwig
  2024-10-02 15:17             ` Keith Busch
  2024-10-02 15:22             ` Jens Axboe
  0 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-02 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
> > The previous stream separation approach made total sense, but just
> > needed a fair amount of work.  But it closely matches how things work
> > at the hardware and file system level, so it was the right approach.
> 
> What am I missing that makes this effort that different from streams?
> Both are a lifetime hint.

A stream has a lot less strings attached.  But hey, don't make me
argue for streams - you pushed the API in despite reservations back
then, and we've learned a lot since.

> > Suddenly dropping that and ignoring all comments really feels like
> > someone urgently needs to pull a marketing stunt here.
> 
> I think someone is just trying to finally get this feature in, so it
> can get used by customers, after many months of fairly unproductive
> discussion.

Well, he finally started on the right approach and gave it up after the
first round of feedback.  That's just a bit weird.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:13           ` Christoph Hellwig
@ 2024-10-02 15:17             ` Keith Busch
  2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:22             ` Jens Axboe
  1 sibling, 1 reply; 44+ messages in thread
From: Keith Busch @ 2024-10-02 15:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

Nothing prevents future improvements in that direction. It just seems
out of scope for what Kanchan is trying to enable for his customer use
cases. This patch looks harmless.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:17             ` Keith Busch
@ 2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:33                 ` Keith Busch
  2024-10-02 15:47                 ` Martin K. Petersen
  0 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-02 15:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:17:35AM -0600, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> > 
> > Well, he finally started on the right approach and gave it up after the
> > first round of feedback.  That's just a bit weird.
> 
> Nothing prevents future improvements in that direction. It just seems
> out of scope for what Kanchan is trying to enable for his customer use
> cases. This patch looks harmless.

It's not really.  Once we wire it up like this we mess up the ability
to use the feature in other ways.  Additionally the per-I/O hints are
simply broken if you want a file system in the way as last time,
and if I don't misremember you acknowledged in person last week.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:13           ` Christoph Hellwig
  2024-10-02 15:17             ` Keith Busch
@ 2024-10-02 15:22             ` Jens Axboe
  1 sibling, 0 replies; 44+ messages in thread
From: Jens Axboe @ 2024-10-02 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/2/24 9:13 AM, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
>>> The previous stream separation approach made total sense, but just
>>> needed a fair amount of work.  But it closely matches how things work
>>> at the hardware and file system level, so it was the right approach.
>>
>> What am I missing that makes this effort that different from streams?
>> Both are a lifetime hint.
> 
> A stream has a lot less strings attached.  But hey, don't make me
> argue for streams - you pushed the API in despite reservations back
> then, and we've learned a lot since.

Hah, I remember arguing for it back then! It was just a pain to use, but
thankfully we could kill it when devices didn't materialize with a
quality of implementation that made them useful. And then we just yanked
it. Nothing is forever, and particularly hints is something that can
always be ignored. Same for this.

>>> Suddenly dropping that and ignoring all comments really feels like
>>> someone urgently needs to pull a marketing stunt here.
>>
>> I think someone is just trying to finally get this feature in, so it
>> can get used by customers, after many months of fairly unproductive
>> discussion.
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

I'm a big fan of keep-it-simple-stupid, and then you can improve it down
the line. I think the simple approach stands just fine alone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:19               ` Christoph Hellwig
@ 2024-10-02 15:33                 ` Keith Busch
  2024-10-03 12:51                   ` Christoph Hellwig
  2024-10-02 15:47                 ` Martin K. Petersen
  1 sibling, 1 reply; 44+ messages in thread
From: Keith Busch @ 2024-10-02 15:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 05:19:49PM +0200, Christoph Hellwig wrote:
> > Nothing prevents future improvements in that direction. It just seems
> > out of scope for what Kanchan is trying to enable for his customer use
> > cases. This patch looks harmless.
> 
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system in the way as last time,
> and if I don't misremember you acknowledged in person last week.

You remember right. I also explained the use cases for per-io hints are
to replace current passthrough users with generic read/write paths that
have all the memory guard rails, read/write stats, and other features
that you don't get with passthrough. This just lets you accomplish the
same placement hints with the nvme raw block device that you get with
the nvme char device.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:33                 ` Keith Busch
@ 2024-10-02 15:47                 ` Martin K. Petersen
  2024-10-02 18:34                   ` Bart Van Assche
  2024-10-03 12:54                   ` Christoph Hellwig
  1 sibling, 2 replies; 44+ messages in thread
From: Martin K. Petersen @ 2024-10-02 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz


Christoph,

>> Nothing prevents future improvements in that direction. It just seems
>> out of scope for what Kanchan is trying to enable for his customer use
>> cases. This patch looks harmless.
>
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system

Here is my take:

It is the kernel's job to manage the system's hardware resources and
arbitrate and share these resources optimally and fairly between all
running applications.

What irks me is defining application interfaces which fundamentally tell
the kernel that "these blocks are part of the same file".

The kernel already knows this. It is the very entity which provides that
abstraction. Why do we need an explicit interface to inform the kernel
that concurrent writes to the same file should have the same
"temperature" or need to go to the same "bin" on the storage device?
Shouldn't that just happen automatically?

Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
we are consistently failing to deliver something that actually works for
anything but a few specialized corner cases. I think that is a shame.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability
  2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
  2024-10-02 14:26       ` Pavel Begunkov
@ 2024-10-02 18:29       ` Bart Van Assche
  1 sibling, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:29 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
> index 309ca72f2dfb..f4373a71ffed 100644
> --- a/include/linux/rw_hint.h
> +++ b/include/linux/rw_hint.h
> @@ -21,4 +21,28 @@ enum rw_hint {
>   static_assert(sizeof(enum rw_hint) == 1);
>   #endif
>   
> +#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
> +
> +static inline bool rw_hint_valid(u64 hint)
> +{
> +	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
> +	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
> +	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
> +	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
> +	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
> +	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
> +
> +	switch (hint) {
> +	case RWH_WRITE_LIFE_NOT_SET:
> +	case RWH_WRITE_LIFE_NONE:
> +	case RWH_WRITE_LIFE_SHORT:
> +	case RWH_WRITE_LIFE_MEDIUM:
> +	case RWH_WRITE_LIFE_LONG:
> +	case RWH_WRITE_LIFE_EXTREME:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

Moving a function that is not in the hot path from a .c file to a .h
file is wrong. This increases the kernel size, slows down compilation
and makes implementation details visible to callers that should not have
access to these implementation details.

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:47                 ` Martin K. Petersen
@ 2024-10-02 18:34                   ` Bart Van Assche
  2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 12:54                   ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:34 UTC (permalink / raw)
  To: Martin K. Petersen, Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/2/24 8:47 AM, Martin K. Petersen wrote:
> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".

Isn't FDP about communicating much more than only this information to
the block device, e.g. information about reclaim units? Although I'm
personally not interested in FDP, my colleagues were involved in the
standardization of FDP.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 1/3] nvme: enable FDP support
  2024-09-30 18:13     ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi
@ 2024-10-02 18:37       ` Bart Van Assche
  2024-10-03 12:55         ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:37 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> 
> Userspace can send the data lifetime information using the write hints.
> The SCSI driver (sd) can already pass this information to the SCSI
> devices. This patch does the same for NVMe.
> 
> Fetch the placement-identifiers if the device supports FDP.
> The incoming write-hint is mapped to a placement-identifier, which in
> turn is set in the DSPEC field of the write command.

Is the description of this patch correct? The above description suggests
that FDP information is similar in nature to SCSI data lifetime
information. I don't think that's correct.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
                     ` (3 preceding siblings ...)
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
@ 2024-10-03  0:20   ` Bart Van Assche
  4 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2024-10-03  0:20 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> Another spin to incorporate the feedback from LPC and previous
> iterations. The series adds two capabilities:
> - FDP support at NVMe level (patch #1)
> - Per-io hinting via io_uring (patch #3)
> Patch #2 is needed to do per-io hints.
> 
> The motivation and interface details are present in the commit
> descriptions.

Something is missing from the patch descriptions. What is the goal
of this patch series? Is the goal to support FDP in filesystems in
the kernel, is the goal to support filesystems implemented in user
space or perhaps something else? I think this should be mentioned in
the cover letter.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:33                 ` Keith Busch
@ 2024-10-03 12:51                   ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:33:16AM -0600, Keith Busch wrote:
> You remember right. I also explained the use cases for per-io hints are
> to replace current passthrough users with generic read/write paths that
> have all the memory guard rails, read/write stats, and other features
> that you don't get with passthrough. This just lets you accomplish the
> same placement hints with the nvme raw block device that you get with
> the nvme char device.

Yes, and as I said before we need an opt-in from the file_ops, because
file systems can't support this.  And I'm speaking that from a position
of working on a file system that made the existing non-idea hints
work (to all fairness Hans did the hard work, but I was part of it).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:47                 ` Martin K. Petersen
  2024-10-02 18:34                   ` Bart Van Assche
@ 2024-10-03 12:54                   ` Christoph Hellwig
  2024-10-03 22:14                     ` Jens Axboe
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:54 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 11:47:32AM -0400, Martin K. Petersen wrote:
> It is the kernel's job to manage the system's hardware resources and
> arbitrate and share these resources optimally and fairly between all
> running applications.

Exactly.

> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".
> 
> The kernel already knows this. It is the very entity which provides that
> abstraction. Why do we need an explicit interface to inform the kernel
> that concurrent writes to the same file should have the same
> "temperature" or need to go to the same "bin" on the storage device?
> Shouldn't that just happen automatically?

For file: yes.  The problem is when you have more files than buckets on
the device or file systems.  Typical enterprise SSDs support somewhere
between 8 and 16 write streams, and there typically is more data than
that.  So trying to group it somehow is good idea as not all files can
have their own bucket.

Allowing this inside a file like done in this patch set on the other
hand is pretty crazy.

> Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
> we are consistently failing to deliver something that actually works for
> anything but a few specialized corner cases. I think that is a shame.

Yes, it is.  And as someone who has been sitting in this group that is
because it's always someone in a place of power forcing down their version
because they got a promotіon, best of show award or whatever.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 18:34                   ` Bart Van Assche
@ 2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 21:48                       ` Keith Busch
  2024-10-04  6:21                       ` Javier González
  0 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, Christoph Hellwig, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> Isn't FDP about communicating much more than only this information to
> the block device, e.g. information about reclaim units? Although I'm
> personally not interested in FDP, my colleagues were involved in the
> standardization of FDP.

Yes, it is.  And when I explained how to properly export this kind of
information can be implemented on top of the version Kanchan sent everyone
suddenly stopped diskussion technical points and went either silent or
all political.

So I think some peoples bonuses depend on not understanding the problem
I fear :(


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 1/3] nvme: enable FDP support
  2024-10-02 18:37       ` Bart Van Assche
@ 2024-10-03 12:55         ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

On Wed, Oct 02, 2024 at 11:37:11AM -0700, Bart Van Assche wrote:
> Is the description of this patch correct? The above description suggests
> that FDP information is similar in nature to SCSI data lifetime
> information. I don't think that's correct.

As you correctly point out, it is entirely incorrect.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:55                     ` Christoph Hellwig
@ 2024-10-03 21:48                       ` Keith Busch
  2024-10-03 22:00                         ` Bart Van Assche
  2024-10-04  6:21                       ` Javier González
  1 sibling, 1 reply; 44+ messages in thread
From: Keith Busch @ 2024-10-03 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Jens Axboe, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 02:55:16PM +0200, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> > Isn't FDP about communicating much more than only this information to
> > the block device, e.g. information about reclaim units? Although I'm
> > personally not interested in FDP, my colleagues were involved in the
> > standardization of FDP.
> 
> Yes, it is.  And when I explained how to properly export this kind of
> information can be implemented on top of the version Kanchan sent everyone
> suddenly stopped diskussion technical points and went either silent or
> all political.

The nominals can mean whatever you want. If you want it to mean
"temperature", then that's what it means. If you want it to mean
something else, then don't use this.

These are hints at the end of the day. Nothing locks the kernel into
this if a better solution develops. As you know, it was ripped out
before.

> So I think some peoples bonuses depend on not understanding the problem
> I fear :(

The only "bonus" I have is not repeatedly explaining why people can't
use h/w features the way they want.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 21:48                       ` Keith Busch
@ 2024-10-03 22:00                         ` Bart Van Assche
  2024-10-03 22:12                           ` Jens Axboe
  2024-10-03 22:17                           ` Keith Busch
  0 siblings, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2024-10-03 22:00 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Martin K. Petersen, Jens Axboe, Kanchan Joshi, hare, sagi,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/3/24 2:48 PM, Keith Busch wrote:
> The only "bonus" I have is not repeatedly explaining why people can't
> use h/w features the way they want.

Hi Keith,

Although that's a fair argument, what are the use cases for this patch
series? Filesystems in the kernel? Filesystems implemented in user
space? Perhaps something else?

This patch series adds new a new user space interface for passing hints
to storage devices (in io_uring). As we all know such interfaces are
hard to remove once these have been added.

We don't need new user space interfaces to support FDP for filesystems
in the kernel.

For filesystems implemented in user space, would using NVMe pass-through
be a viable approach? With this approach, no new user space interfaces
have to be added.

I'm wondering how to unblock FDP users without adding a new
controversial mechanism in the kernel.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:00                         ` Bart Van Assche
@ 2024-10-03 22:12                           ` Jens Axboe
  2024-10-03 22:17                           ` Keith Busch
  1 sibling, 0 replies; 44+ messages in thread
From: Jens Axboe @ 2024-10-03 22:12 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch, Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/3/24 4:00 PM, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
>> The only "bonus" I have is not repeatedly explaining why people can't
>> use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.

It's a _hint_, I'm not sure how many times that has to be stated. The
kernel is free to ignore it, and in the future, it may very well do
that. We already had fcntl interfaces for streams, in the same vein, and
we yanked those in the sense that they ended up doing _nothing_. Did
things break because of that? Of course not. This is no different.

> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

pass-through already works, obviously - this is more about adding a
viable real interface for it. If it's a feature that is in devices AND
customers want to use, then pointing them at pass-through is a cop-out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:54                   ` Christoph Hellwig
@ 2024-10-03 22:14                     ` Jens Axboe
  2024-10-04  5:31                       ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Jens Axboe @ 2024-10-03 22:14 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen
  Cc: Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> For file: yes.  The problem is when you have more files than buckets on
> the device or file systems.  Typical enterprise SSDs support somewhere
> between 8 and 16 write streams, and there typically is more data than
> that.  So trying to group it somehow is good idea as not all files can
> have their own bucket.
> 
> Allowing this inside a file like done in this patch set on the other
> hand is pretty crazy.

I do agree that per-file hints are not ideal. In the spirit of making
some progress, how about we just retain per-io hints initially? We can
certainly make that work over dio. Yes buffered IO won't work initially,
but at least we're getting somewhere.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:00                         ` Bart Van Assche
  2024-10-03 22:12                           ` Jens Axboe
@ 2024-10-03 22:17                           ` Keith Busch
  1 sibling, 0 replies; 44+ messages in thread
From: Keith Busch @ 2024-10-03 22:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Martin K. Petersen, Jens Axboe, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 03:00:21PM -0700, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
> > The only "bonus" I have is not repeatedly explaining why people can't
> > use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.
> 
> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

I really like the passthrough interface. This is flexible and enables
experimenting with all sorts of features.

But it exposes exploits and pitfalls. Just check the CVE's against it
(ex: CVE-2023-6238, note: same problem exists without metadata). The
consequences of misuse include memory corruption and private memory
leakage. The barriers for exploits are too low, and the potential for
mistakes are too high. A possible h/w solution requires an IOMMU, but
that's a non-starter for many users and only solves the private memory
vulnerabilities.

Passthrough also skips all the block layer features. I just today added
statistics to it staged for 6.13, but even that requires yet another
config step to enable it.

tl;dr: no one wants to use nvme passthrough in production long term.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:14                     ` Jens Axboe
@ 2024-10-04  5:31                       ` Christoph Hellwig
  2024-10-04  6:18                         ` Javier González
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-04  5:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Martin K. Petersen, Keith Busch, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> > For file: yes.  The problem is when you have more files than buckets on
> > the device or file systems.  Typical enterprise SSDs support somewhere
> > between 8 and 16 write streams, and there typically is more data than
> > that.  So trying to group it somehow is good idea as not all files can
> > have their own bucket.
> > 
> > Allowing this inside a file like done in this patch set on the other
> > hand is pretty crazy.
> 
> I do agree that per-file hints are not ideal. In the spirit of making
> some progress, how about we just retain per-io hints initially? We can
> certainly make that work over dio. Yes buffered IO won't work initially,
> but at least we're getting somewhere.

Huh?  Per I/O hints at the syscall level are the problem (see also the
reply from Martin).  Per file make total sense, but we need the file
system in control.

The real problem is further down the stack.  For the SCSI temperature
hints just passing them on make sense.  But when you map to some kind
of stream separation in the device, no matter if that is streams, FDP,
or various kinds of streams we don't even support in thing like CF
and SDcard, the driver is not the right place to map temperature hint
to streams.  The requires some kind of intelligence.  It could be
dirt simple and just do a best effort mapping of the temperature
hints 1:1 to separate write streams, or do a little mapping if there
is not enough of them which should work fine for a raw block device.

But one we have a file system things get more complicated:

 - the file system will want it's own streams for metadata and GC
 - even with that on beefy enough hardware you can have more streams
   then temperature levels, and the file system can and should
   do intelligen placement (based usually on files)

Or to summarize:  the per-file temperature hints make sense as a user
interface.  Per-I/O hints tend to be really messy at least if a file
system is involved.  Placing the temperatures to separate write streams
in the driver does not scale even to the most trivial write stream
aware file system implementations.

And for anyone who followed the previous discussions of the patches
none of this should been new, each point has been made at least three
times before.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  5:31                       ` Christoph Hellwig
@ 2024-10-04  6:18                         ` Javier González
  2024-10-04  6:27                           ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Javier González @ 2024-10-04  6:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 07:31, Christoph Hellwig wrote:
>On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
>> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
>> > For file: yes.  The problem is when you have more files than buckets on
>> > the device or file systems.  Typical enterprise SSDs support somewhere
>> > between 8 and 16 write streams, and there typically is more data than
>> > that.  So trying to group it somehow is good idea as not all files can
>> > have their own bucket.
>> >
>> > Allowing this inside a file like done in this patch set on the other
>> > hand is pretty crazy.
>>
>> I do agree that per-file hints are not ideal. In the spirit of making
>> some progress, how about we just retain per-io hints initially? We can
>> certainly make that work over dio. Yes buffered IO won't work initially,
>> but at least we're getting somewhere.
>
>Huh?  Per I/O hints at the syscall level are the problem (see also the
>reply from Martin).  Per file make total sense, but we need the file
>system in control.
>
>The real problem is further down the stack.  For the SCSI temperature
>hints just passing them on make sense.  But when you map to some kind
>of stream separation in the device, no matter if that is streams, FDP,
>or various kinds of streams we don't even support in thing like CF
>and SDcard, the driver is not the right place to map temperature hint
>to streams.  The requires some kind of intelligence.  It could be
>dirt simple and just do a best effort mapping of the temperature
>hints 1:1 to separate write streams, or do a little mapping if there
>is not enough of them which should work fine for a raw block device.
>
>But one we have a file system things get more complicated:
>
> - the file system will want it's own streams for metadata and GC
> - even with that on beefy enough hardware you can have more streams
>   then temperature levels, and the file system can and should
>   do intelligen placement (based usually on files)
>
>Or to summarize:  the per-file temperature hints make sense as a user
>interface.  Per-I/O hints tend to be really messy at least if a file
>system is involved.  Placing the temperatures to separate write streams
>in the driver does not scale even to the most trivial write stream
>>
>And for anyone who followed the previous discussions of the patches
>none of this should been new, each point has been made at least three
>times before.

Looking at the work you and Hans have been doing on XFS, it seems you
have been successful at mapping the semantics of the temperature to
zones (which has no semantics, just as FDP).

    What is the difference between the mapping in zones and for FDP?

The whole point is using an existing interface to cover the use-case of
people wanting hints in block. If you have a use-case for enabling hints
on file, let's work on this aftterwards.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 21:48                       ` Keith Busch
@ 2024-10-04  6:21                       ` Javier González
  2024-10-04  6:24                         ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Javier González @ 2024-10-04  6:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 03.10.2024 14:55, Christoph Hellwig wrote:
>On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
>> Isn't FDP about communicating much more than only this information to
>> the block device, e.g. information about reclaim units? Although I'm
>> personally not interested in FDP, my colleagues were involved in the
>> standardization of FDP.
>
>Yes, it is.  And when I explained how to properly export this kind of
>information can be implemented on top of the version Kanchan sent everyone
>suddenly stopped diskussion technical points and went either silent or
>all political.
>
>So I think some peoples bonuses depend on not understanding the problem
>I fear :(
>

Please, don't.

Childish comments like this delegitimize the work that a lot of people
are doing in Linux.

We all operate under the assumption that folks here know how to wear two
hants, and that there is no such thing as so specific incentives
neither to push _nor_ block upstream contributions without a use-case
and technical background.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:21                       ` Javier González
@ 2024-10-04  6:24                         ` Christoph Hellwig
  2024-10-04  6:59                           ` Javier González
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-04  6:24 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen,
	Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g

On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>> So I think some peoples bonuses depend on not understanding the problem
>> I fear :(
>>
>
> Please, don't.
>
> Childish comments like this delegitimize the work that a lot of people
> are doing in Linux.

It's the only very sarcastic explanation I can come up with to explain
this discussion, where people from the exactly two companies where this
might be bonus material love political discussion and drop dead when it
turns technical.

I'm probably wrong, but I've run out of other explanations.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:18                         ` Javier González
@ 2024-10-04  6:27                           ` Christoph Hellwig
  2024-10-04  6:52                             ` Javier González
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-04  6:27 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>> And for anyone who followed the previous discussions of the patches
>> none of this should been new, each point has been made at least three
>> times before.
>
> Looking at the work you and Hans have been doing on XFS, it seems you
> have been successful at mapping the semantics of the temperature to
> zones (which has no semantics, just as FDP).
>
>    What is the difference between the mapping in zones and for FDP?

Probably not much, except for all the pitfalls in the FDP not quite hint
not madatory design.

> The whole point is using an existing interface to cover the use-case of
> people wanting hints in block.

And that's fine.  And that point all the way back for month is that
doing a complete dumb mapping in the driver for that is fundamentally
wrong.  Kanchan's previous series did about 50% of the work to get
it rid, but then everyone dropped dead and played politics.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:27                           ` Christoph Hellwig
@ 2024-10-04  6:52                             ` Javier González
  2024-10-04 12:30                               ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Javier González @ 2024-10-04  6:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 08:27, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>>> And for anyone who followed the previous discussions of the patches
>>> none of this should been new, each point has been made at least three
>>> times before.
>>
>> Looking at the work you and Hans have been doing on XFS, it seems you
>> have been successful at mapping the semantics of the temperature to
>> zones (which has no semantics, just as FDP).
>>
>>    What is the difference between the mapping in zones and for FDP?
>
>Probably not much, except for all the pitfalls in the FDP not quite hint
>not madatory design.

It is too late to change the first version of FDP productas, and the
solution of "go back to NVMe and make it match SCSI" is not realistic.

We have drives from several companies out there supporting this version
of FDP. And we have customers that are actively using them. It is our
collective responsibility to support them.

So, considerign that file system _are_ able to use temperature hints and
actually make them work, why don't we support FDP the same way we are
supporting zones so that people can use it in production?

I agree that down the road, an interface that allows hints (many more
than 5!) is needed. And in my opinion, this interface should not have
semintics attached to it, just a hint ID, #hints, and enough space to
put 100s of them to support storage node deployments. But this needs to
come from the users of the hints / zones / streams / etc,  not from
us vendors. We do not have neither details on how they deploy these
features at scale, nor the workloads to validate the results. Anything
else will probably just continue polluting the storage stack with more
interfaces that are not used and add to the problem of data placement
fragmentation.

>
>> The whole point is using an existing interface to cover the use-case of
>> people wanting hints in block.
>
>And that's fine.  And that point all the way back for month is that
>doing a complete dumb mapping in the driver for that is fundamentally
>wrong.  Kanchan's previous series did about 50% of the work to get
>it rid, but then everyone dropped dead and played politics.

The issue is that the first series of this patch, which is as simple as
it gets, hit the list in May. Since then we are down paths that lead
nowhere. So the line between real technical feedback that leads to
a feature being merged, and technical misleading to make people be a
busy bee becomes very thin. In the whole data placement effort, we have
been down this path many times, unfortunately...

At LPC, we discussed about the work you did in XFS and it became clear
that coming back to the first version of the patches was the easiest way
to support the use-case that current customers have. It is a pity you
did not attend the conference and could make a point against this line
of thought there.

So summarizing, all folks in this thread are asking is for a path to
move forward for solving the block use-case, which is the only one that
I am aware of requiring kernel support. This has been validated with
real workloads, so it is very specific and tangible.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:24                         ` Christoph Hellwig
@ 2024-10-04  6:59                           ` Javier González
  2024-10-04 12:32                             ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Javier González @ 2024-10-04  6:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 04.10.2024 08:24, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>>> So I think some peoples bonuses depend on not understanding the problem
>>> I fear :(
>>>
>>
>> Please, don't.
>>
>> Childish comments like this delegitimize the work that a lot of people
>> are doing in Linux.
>
>It's the only very sarcastic explanation I can come up with to explain
>this discussion, where people from the exactly two companies where this
>might be bonus material love political discussion and drop dead when it
>turns technical.

FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
Microship, Marvell, FADU, WDC, and Samsung.

The fact that 2 of these companies are the ones starting to build the
Linux ecosystem should not surprise you, as it is the way things work
normally.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:52                             ` Javier González
@ 2024-10-04 12:30                               ` Christoph Hellwig
  2024-10-07 10:10                                 ` Javier González
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:30 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> So, considerign that file system _are_ able to use temperature hints and
> actually make them work, why don't we support FDP the same way we are
> supporting zones so that people can use it in production?

Because apparently no one has tried it.  It should be possible in theory,
but for example unless you have power of 2 reclaim unit size size it
won't work very well with XFS where the AGs/RTGs must be power of two
aligned in the LBA space, except by overprovisioning the LBA space vs
the capacity actually used.

> I agree that down the road, an interface that allows hints (many more
> than 5!) is needed. And in my opinion, this interface should not have
> semintics attached to it, just a hint ID, #hints, and enough space to
> put 100s of them to support storage node deployments. But this needs to
> come from the users of the hints / zones / streams / etc,  not from
> us vendors. We do not have neither details on how they deploy these
> features at scale, nor the workloads to validate the results. Anything
> else will probably just continue polluting the storage stack with more
> interfaces that are not used and add to the problem of data placement
> fragmentation.

Please always mentioned what layer you are talking about.  At the syscall
level the temperatur hints are doing quite ok.  A full stream separation
would obviously be a lot better, as would be communicating the zone /
reclaim unit / etc size.

As an interface to a driver that doesn't natively speak temperature
hint on the other hand it doesn't work at all.

> The issue is that the first series of this patch, which is as simple as
> it gets, hit the list in May. Since then we are down paths that lead
> nowhere. So the line between real technical feedback that leads to
> a feature being merged, and technical misleading to make people be a
> busy bee becomes very thin. In the whole data placement effort, we have
> been down this path many times, unfortunately...

Well, the previous round was the first one actually trying to address the
fundamental issue after 4 month.  And then after a first round of feedback
it gets shutdown somehow out of nowhere.  As a maintainer and review that
is the kinda of contributors I have a hard time taking serious.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:59                           ` Javier González
@ 2024-10-04 12:32                             ` Christoph Hellwig
  2024-10-07 11:29                               ` Javier González
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:32 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen,
	Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g

On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
> Microship, Marvell, FADU, WDC, and Samsung.
>
> The fact that 2 of these companies are the ones starting to build the
> Linux ecosystem should not surprise you, as it is the way things work
> normally.

That's not the point.  There is one company that drivers entirely pointless
marketing BS, and that one is pretty central here.  The same company
that said FDP has absolutely no іntent to work on Linux and fought my
initial attempt to make the protocol not totally unusable ony layer system.
And no, that's not Samsung.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04 12:30                               ` Christoph Hellwig
@ 2024-10-07 10:10                                 ` Javier González
  0 siblings, 0 replies; 44+ messages in thread
From: Javier González @ 2024-10-07 10:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 14:30, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> So, considerign that file system _are_ able to use temperature hints and
>> actually make them work, why don't we support FDP the same way we are
>> supporting zones so that people can use it in production?
>
>Because apparently no one has tried it.  It should be possible in theory,
>but for example unless you have power of 2 reclaim unit size size it
>won't work very well with XFS where the AGs/RTGs must be power of two
>aligned in the LBA space, except by overprovisioning the LBA space vs
>the capacity actually used.

This is good. I think we should have at least a FS POC with data
placement support to be able to drive conclusions on how the interface
and requirements should be. Until we have that, we can support the
use-cases that we know customers are asking for, i.e., block-level hints
through the existing temperature API.

>
>> I agree that down the road, an interface that allows hints (many more
>> than 5!) is needed. And in my opinion, this interface should not have
>> semintics attached to it, just a hint ID, #hints, and enough space to
>> put 100s of them to support storage node deployments. But this needs to
>> come from the users of the hints / zones / streams / etc,  not from
>> us vendors. We do not have neither details on how they deploy these
>> features at scale, nor the workloads to validate the results. Anything
>> else will probably just continue polluting the storage stack with more
>> interfaces that are not used and add to the problem of data placement
>> fragmentation.
>
>Please always mentioned what layer you are talking about.  At the syscall
>level the temperatur hints are doing quite ok.  A full stream separation
>would obviously be a lot better, as would be communicating the zone /
>reclaim unit / etc size.

I mean at the syscall level. But as mentioned above, we need to be very
sure that we have a clear use-case for that. If we continue seeing hints
being use in NVMe or other protocols, and the number increase
significantly, we can deal with it later on.

>
>As an interface to a driver that doesn't natively speak temperature
>hint on the other hand it doesn't work at all.
>
>> The issue is that the first series of this patch, which is as simple as
>> it gets, hit the list in May. Since then we are down paths that lead
>> nowhere. So the line between real technical feedback that leads to
>> a feature being merged, and technical misleading to make people be a
>> busy bee becomes very thin. In the whole data placement effort, we have
>> been down this path many times, unfortunately...
>
>Well, the previous round was the first one actually trying to address the
>fundamental issue after 4 month.  And then after a first round of feedback
>it gets shutdown somehow out of nowhere.  As a maintainer and review that
>is the kinda of contributors I have a hard time taking serious.

I am not sure I understand what you mean in the last sentece, so I will
not respond filling blanks with a bad interpretation.

In summary, what we are asking for is to take the patches that cover the
current use-case, and work together on what might be needed for better
FS support. For this, it seems you and Hans have a good idea of what you
want to have based on XFS. We can help review or do part of the work,
but trying to guess our way will only delay existing customers using
existing HW.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04 12:32                             ` Christoph Hellwig
@ 2024-10-07 11:29                               ` Javier González
  0 siblings, 0 replies; 44+ messages in thread
From: Javier González @ 2024-10-07 11:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 04.10.2024 14:32, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
>> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
>> Microship, Marvell, FADU, WDC, and Samsung.
>>
>> The fact that 2 of these companies are the ones starting to build the
>> Linux ecosystem should not surprise you, as it is the way things work
>> normally.
>
>That's not the point.  There is one company that drivers entirely pointless
>marketing BS, and that one is pretty central here.  The same company
>that said FDP has absolutely no іntent to work on Linux and fought my
>initial attempt to make the protocol not totally unusable ony layer system.
>And no, that's not Samsung.

So you had an interaction in the working group, your feedback was not
taking into consideration by the authors, and the result is that FDP
cannot be supported in Linux as a consequence of that? Come on...

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-10-07 11:29 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com>
2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
     [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
2024-09-30 18:13     ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi
2024-10-02 18:37       ` Bart Van Assche
2024-10-03 12:55         ` Christoph Hellwig
     [not found]   ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com>
2024-09-30 18:13     ` [PATCH v7 2/3] block, fs: restore kiocb based write hint processing Kanchan Joshi
     [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
2024-10-02 14:26       ` Pavel Begunkov
2024-10-02 18:29       ` Bart Van Assche
2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
2024-10-01 15:58     ` James R. Bergsten
2024-10-01 16:18     ` Jens Axboe
2024-10-02  7:51       ` Christoph Hellwig
2024-10-02 15:03         ` Jens Axboe
2024-10-02 15:13           ` Christoph Hellwig
2024-10-02 15:17             ` Keith Busch
2024-10-02 15:19               ` Christoph Hellwig
2024-10-02 15:33                 ` Keith Busch
2024-10-03 12:51                   ` Christoph Hellwig
2024-10-02 15:47                 ` Martin K. Petersen
2024-10-02 18:34                   ` Bart Van Assche
2024-10-03 12:55                     ` Christoph Hellwig
2024-10-03 21:48                       ` Keith Busch
2024-10-03 22:00                         ` Bart Van Assche
2024-10-03 22:12                           ` Jens Axboe
2024-10-03 22:17                           ` Keith Busch
2024-10-04  6:21                       ` Javier González
2024-10-04  6:24                         ` Christoph Hellwig
2024-10-04  6:59                           ` Javier González
2024-10-04 12:32                             ` Christoph Hellwig
2024-10-07 11:29                               ` Javier González
2024-10-03 12:54                   ` Christoph Hellwig
2024-10-03 22:14                     ` Jens Axboe
2024-10-04  5:31                       ` Christoph Hellwig
2024-10-04  6:18                         ` Javier González
2024-10-04  6:27                           ` Christoph Hellwig
2024-10-04  6:52                             ` Javier González
2024-10-04 12:30                               ` Christoph Hellwig
2024-10-07 10:10                                 ` Javier González
2024-10-02 15:22             ` Jens Axboe
2024-10-01 16:23     ` Keith Busch
2024-10-02  7:49       ` Christoph Hellwig
2024-10-02 14:56         ` Keith Busch
2024-10-02 15:00           ` Jens Axboe
2024-10-03  0:20   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox