public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v7 0/3] FDP and per-io hints
       [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com>
@ 2024-09-30 18:13 ` Kanchan Joshi
       [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
                     ` (5 more replies)
  0 siblings, 6 replies; 86+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi

Another spin to incorporate the feedback from LPC and previous
iterations. The series adds two capabilities:
- FDP support at NVMe level (patch #1)
- Per-io hinting via io_uring (patch #3)
Patch #2 is needed to do per-io hints.

The motivation and interface details are present in the commit
descriptions.

Testing:
Done with fcntl and liburing based custom applications.
On raw block device, ext4, xfs, btrfs and F2FS.
Checked that no regression occurs for application that use per-inode
hints.
Checked that per-io hints, when passed, take the precedence over per-inode
hints.

Changes since v6:
- Change io_uring interface to pass hints as SQE metadata (Pavel, Hannes)

Changes since v5:
- Drop placement hints
- Add per-io hint interface

Changes since v4:
- Retain the size/type checking on the enum (Bart)
- Use the name "*_lifetime_hint" rather than "*_life_hint" (Bart)

Changes since v3:
- 4 new patches to introduce placement hints
- Make nvme patch use the placement hints rather than lifetime hints

Changes since v2:
- Base it on nvme-6.11 and resolve a merge conflict

Changes since v1:
- Reduce the fetched plids from 128 to 6 (Keith)
- Use struct_size for a calculation (Keith)
- Handle robot/sparse warning

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

 block/fops.c                  |  6 +--
 drivers/nvme/host/core.c      | 70 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  4 ++
 fs/aio.c                      |  1 +
 fs/cachefiles/io.c            |  1 +
 fs/direct-io.c                |  2 +-
 fs/fcntl.c                    | 22 -----------
 fs/iomap/direct-io.c          |  2 +-
 include/linux/fs.h            |  8 ++++
 include/linux/nvme.h          | 19 ++++++++++
 include/linux/rw_hint.h       | 24 ++++++++++++
 include/uapi/linux/io_uring.h | 19 ++++++++++
 io_uring/rw.c                 | 24 ++++++++++++
 13 files changed, 175 insertions(+), 27 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v7 1/3] nvme: enable FDP support
       [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  2024-10-02 18:37       ` Bart Van Assche
  0 siblings, 1 reply; 86+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Hui Qi,
	Nitesh Shetty

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.

Userspace can send the data lifetime information using the write hints.
The SCSI driver (sd) can already pass this information to the SCSI
devices. This patch does the same for NVMe.

Fetch the placement-identifiers if the device supports FDP.
The incoming write-hint is mapped to a placement-identifier, which in
turn is set in the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Nacked-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
---
 drivers/nvme/host/core.c | 70 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 +++
 include/linux/nvme.h     | 19 +++++++++++
 3 files changed, 93 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ba6508455e18..ad5cc1ec8c4f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
 	bool is_removed;
 };
 
+struct nvme_fdp_ruh_status_desc {
+	u16 pid;
+	u16 ruhid;
+	u32 earutr;
+	u64 ruamw;
+	u8  rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+	u8  rsvd0[14];
+	__le16 nruhsd;
+	struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -959,6 +973,19 @@ static bool nvme_valid_atomic_write(struct request *req)
 	return true;
 }
 
+static inline void nvme_assign_placement_id(struct nvme_ns *ns,
+					struct request *req,
+					struct nvme_command *cmd)
+{
+	enum rw_hint h = req->write_hint;
+
+	if (h >= ns->head->nr_plids)
+		return;
+
+	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
+	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -1078,6 +1105,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		break;
 	case REQ_OP_WRITE:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (!ret && ns->head->nr_plids)
+			nvme_assign_placement_id(ns, req, cmd);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -2114,6 +2143,40 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	return ret;
 }
 
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+	struct nvme_command c = {};
+	struct nvme_fdp_ruh_status *ruhs;
+	struct nvme_fdp_ruh_status_desc *ruhsd;
+	int size, ret, i;
+
+	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+	ruhs = kzalloc(size, GFP_KERNEL);
+	if (!ruhs)
+		return -ENOMEM;
+
+	c.imr.opcode = nvme_cmd_io_mgmt_recv;
+	c.imr.nsid = cpu_to_le32(nsid);
+	c.imr.mo = 0x1;
+	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+	if (ret)
+		goto out;
+
+	ns->head->nr_plids = le16_to_cpu(ruhs->nruhsd);
+	ns->head->nr_plids =
+		min_t(u16, ns->head->nr_plids, NVME_MAX_PLIDS);
+
+	for (i = 0; i < ns->head->nr_plids; i++) {
+		ruhsd = &ruhs->ruhsd[i];
+		ns->head->plids[i] = le16_to_cpu(ruhsd->pid);
+	}
+out:
+	kfree(ruhs);
+	return ret;
+}
+
 static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		struct nvme_ns_info *info)
 {
@@ -2205,6 +2268,13 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		if (ret && !nvme_first_scan(ns->disk))
 			goto out;
 	}
+	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+		ret = nvme_fetch_fdp_plids(ns, info->nsid);
+		if (ret)
+			dev_warn(ns->ctrl->device,
+				"FDP failure status:0x%x\n", ret);
+	}
+
 
 	ret = 0;
 out:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 313a4f978a2c..a959a9859e8b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+#define NVME_MAX_PLIDS   (WRITE_LIFE_EXTREME + 1)
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,8 @@ struct nvme_ns_head {
 	struct device		cdev_device;
 
 	struct gendisk		*disk;
+	u16			nr_plids;
+	u16			plids[NVME_MAX_PLIDS];
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e..a954eaee5b0f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
 	NVME_CTRL_ATTR_HID_128_BIT	= (1 << 0),
 	NVME_CTRL_ATTR_TBKAS		= (1 << 6),
 	NVME_CTRL_ATTR_ELBAS		= (1 << 15),
+	NVME_CTRL_ATTR_FDPS		= (1 << 19),
 };
 
 struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_register	= 0x0d,
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_io_mgmt_recv	= 0x12,
 	nvme_cmd_resv_release	= 0x15,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
+		nvme_opcode_name(nvme_cmd_io_mgmt_recv),	\
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
@@ -1015,6 +1018,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
 	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+	NVME_RW_DTYPE_DPLCMT		= 2 << 4,
 	NVME_WZ_DEAC			= 1 << 9,
 };
 
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_io_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			mo;
+	__u8			rsvd11;
+	__u16			mos;
+	__le32			numd;
+	__le32			cdw12[4];
+};
+
 enum {
 	NVME_ZRA_ZONE_REPORT		= 0,
 	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 2/3] block, fs: restore kiocb based write hint processing
       [not found]   ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  0 siblings, 0 replies; 86+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
("fs: remove kiocb.ki_hint").
But write hint has made a comeback with commit 449813515d3e ("block, fs:
Restore the per-bio/request data lifetime fields").

This patch uses the leftover space in kiocb to carve 1 byte field
ki_write_hint.
Restore the code that operates on kiocb to use ki_write_hint instead of
inode hint value.

This does not bring any behavior change, but needed to enable per-io
hints (by another patch).

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
---
 block/fops.c         | 6 +++---
 fs/aio.c             | 1 +
 fs/cachefiles/io.c   | 1 +
 fs/direct-io.c       | 2 +-
 fs/iomap/direct-io.c | 2 +-
 include/linux/fs.h   | 8 ++++++++
 io_uring/rw.c        | 1 +
 7 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index e696ae53bf1e..85b9b97d372c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
diff --git a/fs/aio.c b/fs/aio.c
index e8920178b50f..db618817e670 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
 	req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
 		req->ki_flags |= IOCB_EVENTFD;
+	req->ki_write_hint = file_write_hint(req->ki_filp);
 	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
 		/*
 		 * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 6a821a959b59..c3db102ae64e 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object,
 	ki->iocb.ki_pos		= start_pos;
 	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE;
 	ki->iocb.ki_ioprio	= get_current_ioprio();
+	ki->iocb.ki_write_hint  = file_write_hint(file);
 	ki->object		= object;
 	ki->start		= start_pos;
 	ki->len			= len;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index bbd05f1a2145..73629e26becb 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_io;
 	if (dio->is_pinned)
 		bio_set_flag(bio, BIO_PAGE_PINNED);
-	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = dio->iocb->ki_write_hint;
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f637aa0706a3..fff43f121ee6 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
-		bio->bi_write_hint = inode->i_write_hint;
+		bio->bi_write_hint = dio->iocb->ki_write_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e3c603d01337..3dfe6de7b611 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	enum rw_hint		ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
@@ -2337,12 +2338,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
 	       !vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
 }
 
+static inline enum rw_hint file_write_hint(struct file *filp)
+{
+	return file_inode(filp)->i_write_hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
 		.ki_filp = filp,
 		.ki_flags = filp->f_iocb_flags,
 		.ki_ioprio = get_current_ioprio(),
+		.ki_write_hint = file_write_hint(filp),
 	};
 }
 
@@ -2353,6 +2360,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 		.ki_filp = filp,
 		.ki_flags = kiocb_src->ki_flags,
 		.ki_ioprio = kiocb_src->ki_ioprio,
+		.ki_write_hint = kiocb_src->ki_write_hint,
 		.ki_pos = kiocb_src->ki_pos,
 	};
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f023ff49c688..510123d3d837 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1023,6 +1023,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
+	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v7 3/3] io_uring: enable per-io hinting capability
       [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
@ 2024-09-30 18:13     ` Kanchan Joshi
  2024-10-02 14:26       ` Pavel Begunkov
  2024-10-02 18:29       ` Bart Van Assche
  0 siblings, 2 replies; 86+ messages in thread
From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw)
  To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down.
This can be limiting for large files (and for block device) as all the
writes can be tagged with only one lifetime hint value.
Concurrent writes (with different hint values) are hard to manage.
Per-IO hinting solves that problem.

Allow userspace to pass additional metadata in the SQE.
The type of passed metadata is expressed by a new field

	__u16 meta_type;

At this point one type META_TYPE_LIFETIME_HINT is supported.
With this type, user can pass lifetime hint values in the new field

	__u64 lifetime_val;

This accepts all lifetime hint values that are possible with
F_SET_RW_HINT fcntl.

The write handlers (io_prep_rw, io_write) send the hint value to
lower-layer using kiocb. This is good for upporting direct IO,
but not when kiocb is not available (e.g., buffered IO).

When per-io hints are not passed, the per-inode hint values are set in
the kiocb (as before). Otherwise, these take the precedence on per-inode
hints.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 fs/fcntl.c                    | 22 ----------------------
 include/linux/rw_hint.h       | 24 ++++++++++++++++++++++++
 include/uapi/linux/io_uring.h | 19 +++++++++++++++++++
 io_uring/rw.c                 | 25 ++++++++++++++++++++++++-
 4 files changed, 67 insertions(+), 23 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22dd9dcce7ec..a390a05f4ef8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
-static bool rw_hint_valid(u64 hint)
-{
-	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
-	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
-	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
-	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
-	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
-	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
-
-	switch (hint) {
-	case RWH_WRITE_LIFE_NOT_SET:
-	case RWH_WRITE_LIFE_NONE:
-	case RWH_WRITE_LIFE_SHORT:
-	case RWH_WRITE_LIFE_MEDIUM:
-	case RWH_WRITE_LIFE_LONG:
-	case RWH_WRITE_LIFE_EXTREME:
-		return true;
-	default:
-		return false;
-	}
-}
-
 static long fcntl_get_rw_hint(struct file *file, unsigned int cmd,
 			      unsigned long arg)
 {
diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
index 309ca72f2dfb..f4373a71ffed 100644
--- a/include/linux/rw_hint.h
+++ b/include/linux/rw_hint.h
@@ -21,4 +21,28 @@ enum rw_hint {
 static_assert(sizeof(enum rw_hint) == 1);
 #endif
 
+#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
+
+static inline bool rw_hint_valid(u64 hint)
+{
+	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
+	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
+	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
+	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
+	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
+	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
+
+	switch (hint) {
+	case RWH_WRITE_LIFE_NOT_SET:
+	case RWH_WRITE_LIFE_NONE:
+	case RWH_WRITE_LIFE_SHORT:
+	case RWH_WRITE_LIFE_MEDIUM:
+	case RWH_WRITE_LIFE_LONG:
+	case RWH_WRITE_LIFE_EXTREME:
+		return true;
+	default:
+		return false;
+	}
+}
+
 #endif /* _LINUX_RW_HINT_H */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..951e35226229 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,12 +92,23 @@ struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			/* Bit field to express 16 meta types */
+			__u16	meta_type;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
 			__u64	addr3;
 			__u64	__pad2[1];
 		};
+		struct {
+			/* First meta type specific fields */
+			__u64	lifetime_val;
+			/* For future use */
+			__u64	__pad5[1];
+		};
 		__u64	optval;
 		/*
 		 * If the ring is initialized with IORING_SETUP_SQE128, then
@@ -107,6 +118,14 @@ struct io_uring_sqe {
 	};
 };
 
+enum io_uring_sqe_meta_type_bits {
+	META_TYPE_LIFETIME_HINT_BIT
+};
+
+/* this meta type covers write hint values supported by F_SET_RW_HINT fcntl */
+#define META_TYPE_LIFETIME_HINT	(1U << META_TYPE_LIFETIME_HINT_BIT)
+
+
 /*
  * If sqe->file_index is set to this for opcodes that instantiate a new
  * direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 510123d3d837..bf45ee8904a4 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,24 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	if (ddir == ITER_SOURCE) {
+		u16 mtype = READ_ONCE(sqe->meta_type);
+
+		rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID;
+		if (mtype) {
+			u64 lhint = READ_ONCE(sqe->lifetime_val);
+
+			if (READ_ONCE(sqe->__pad4[0]) ||
+			    READ_ONCE(sqe->__pad5[0]))
+				return -EINVAL;
+
+			if (mtype != META_TYPE_LIFETIME_HINT ||
+			    !rw_hint_valid(lhint))
+				return -EINVAL;
+
+			rw->kiocb.ki_write_hint = lhint;
+		}
+	}
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
@@ -1023,7 +1041,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
-	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
+	/*
+	 * Use per-file hint only if per-io hint is not set.
+	 * We need per-io hint to get precedence.
+	 */
+	if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID)
+		rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
                     ` (2 preceding siblings ...)
       [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
@ 2024-10-01  9:20   ` Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
                       ` (2 more replies)
  2024-10-03  0:20   ` Bart Van Assche
  2024-10-15  5:50   ` Christoph Hellwig
  5 siblings, 3 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-01  9:20 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

Any reason you completely ignored my feedback on the last version
and did not even answer?

That's not a very productive way to work.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
@ 2024-10-01 15:58     ` James R. Bergsten
  2024-10-01 16:18     ` Jens Axboe
  2024-10-01 16:23     ` Keith Busch
  2 siblings, 0 replies; 86+ messages in thread
From: James R. Bergsten @ 2024-10-01 15:58 UTC (permalink / raw)
  To: 'Christoph Hellwig', 'Kanchan Joshi'
  Cc: axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

Now THIS Is the NVMe I came to know and love.  😊

-----Original Message-----
From: Linux-nvme <[email protected]> On Behalf Of Christoph Hellwig
Sent: Tuesday, October 1, 2024 2:21 AM
To: Kanchan Joshi <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH v7 0/3] FDP and per-io hints

Any reason you completely ignored my feedback on the last version and did not even answer?

That's not a very productive way to work.




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
@ 2024-10-01 16:18     ` Jens Axboe
  2024-10-02  7:51       ` Christoph Hellwig
  2024-10-01 16:23     ` Keith Busch
  2 siblings, 1 reply; 86+ messages in thread
From: Jens Axboe @ 2024-10-01 16:18 UTC (permalink / raw)
  To: Christoph Hellwig, Kanchan Joshi
  Cc: kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk,
	bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/1/24 3:20 AM, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?

Probably because he got a little tired of dealing with the bullshit
related to this topic.

> That's not a very productive way to work.

Have to say, that neither have your responses been. Can't really fault
people for just giving up at some point, when no productive end seems to
be in sight.

As far as I'm concerned, this looks fine to me. There are customers
wanting to use us, and folks making drives that support it. It's not
right to continually gatekeep this feature just because you don't like
it.

I need to review the io_uring bits, but that should be the least of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
  2024-10-01 15:58     ` James R. Bergsten
  2024-10-01 16:18     ` Jens Axboe
@ 2024-10-01 16:23     ` Keith Busch
  2024-10-02  7:49       ` Christoph Hellwig
  2 siblings, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-01 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 11:20:48AM +0200, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?
> 
> That's not a very productive way to work.

I think because he's getting conflicting feedback. The arguments against
it being that it's not a good match, however it's the same match created
for streams, and no one complained then, and it's an existing user ABI.
FDP is a the new "streams", so I really don't see a big deal here. I
understand you have use cases that have filesystems intercept the hints
for different types of devices, but I don't see these use cases as
mutally exclusive. Supporting this one doesn't prevent other uses.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01 16:23     ` Keith Busch
@ 2024-10-02  7:49       ` Christoph Hellwig
  2024-10-02 14:56         ` Keith Busch
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-02  7:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> I think because he's getting conflicting feedback. The arguments against
> it being that it's not a good match, however it's the same match created
> for streams, and no one complained then, and it's an existing user ABI.

People complained, and the streams code got removed pretty quickly
because it did not work as expected.  I don't think that counts as
a big success.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-01 16:18     ` Jens Axboe
@ 2024-10-02  7:51       ` Christoph Hellwig
  2024-10-02 15:03         ` Jens Axboe
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-02  7:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
> Have to say, that neither have your responses been. Can't really fault
> people for just giving up at some point, when no productive end seems to
> be in sight.

I heavily disagree and take offence on that.

The previous stream separation approach made total sense, but just
needed a fair amount of work.  But it closely matches how things work
at the hardware and file system level, so it was the right approach.

Suddenly dropping that and ignoring all comments really feels like
someone urgently needs to pull a marketing stunt here.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability
  2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
@ 2024-10-02 14:26       ` Pavel Begunkov
  2024-10-17 14:58         ` Kanchan Joshi
  2024-10-02 18:29       ` Bart Van Assche
  1 sibling, 1 reply; 86+ messages in thread
From: Pavel Begunkov @ 2024-10-02 14:26 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/30/24 19:13, Kanchan Joshi wrote:
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down.
> This can be limiting for large files (and for block device) as all the
> writes can be tagged with only one lifetime hint value.
> Concurrent writes (with different hint values) are hard to manage.
> Per-IO hinting solves that problem.
> 
> Allow userspace to pass additional metadata in the SQE.
> The type of passed metadata is expressed by a new field
> 
> 	__u16 meta_type;

The new layout looks nicer, but let me elaborate on the previous
comment. I don't believe we should be restricting to only one
attribute per IO. What if someone wants to pass a lifetime hint
together with integrity information?

Instead, we might need something more extensible like an ability
to pass a list / array of typed attributes / meta information / hints
etc. An example from networking I gave last time was control messages,
i.e. cmsg. In a basic oversimplified form the API from the user
perspective could look like:

struct meta_attr {
	u16 type;
	u64 data;
};

struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}};
sqe->meta_attrs = attr;
sqe->meta_nr = 2;

I'm pretty sure there will be a bunch of considerations people
will name the API should account for, like sizing the struct
so it can fit integrity bits or even making it variable sized.


> At this point one type META_TYPE_LIFETIME_HINT is supported.
> With this type, user can pass lifetime hint values in the new field
> 
> 	__u64 lifetime_val;
> 
> This accepts all lifetime hint values that are possible with
> F_SET_RW_HINT fcntl.
> 
> The write handlers (io_prep_rw, io_write) send the hint value to
> lower-layer using kiocb. This is good for upporting direct IO,
> but not when kiocb is not available (e.g., buffered IO).
> 
> When per-io hints are not passed, the per-inode hint values are set in
> the kiocb (as before). Otherwise, these take the precedence on per-inode
> hints.
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02  7:49       ` Christoph Hellwig
@ 2024-10-02 14:56         ` Keith Busch
  2024-10-02 15:00           ` Jens Axboe
  0 siblings, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-02 14:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> > I think because he's getting conflicting feedback. The arguments against
> > it being that it's not a good match, however it's the same match created
> > for streams, and no one complained then, and it's an existing user ABI.
> 
> People complained, and the streams code got removed pretty quickly
> because it did not work as expected.  I don't think that counts as
> a big success.

I don't think the kernel API was the problem. Capable devices never
materialized, so the code wasn't doing anything useful.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 14:56         ` Keith Busch
@ 2024-10-02 15:00           ` Jens Axboe
  0 siblings, 0 replies; 86+ messages in thread
From: Jens Axboe @ 2024-10-02 15:00 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/2/24 8:56 AM, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
>> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
>>> I think because he's getting conflicting feedback. The arguments against
>>> it being that it's not a good match, however it's the same match created
>>> for streams, and no one complained then, and it's an existing user ABI.
>>
>> People complained, and the streams code got removed pretty quickly
>> because it did not work as expected.  I don't think that counts as
>> a big success.
> 
> I don't think the kernel API was the problem. Capable devices never
> materialized, so the code wasn't doing anything useful.

Exactly. I never saw ones that kept the stream persistent across GC,
and hence they ended up being pretty useless in practice.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02  7:51       ` Christoph Hellwig
@ 2024-10-02 15:03         ` Jens Axboe
  2024-10-02 15:13           ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Jens Axboe @ 2024-10-02 15:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/2/24 1:51 AM, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
>> Have to say, that neither have your responses been. Can't really fault
>> people for just giving up at some point, when no productive end seems to
>> be in sight.
> 
> I heavily disagree and take offence on that.
> 
> The previous stream separation approach made total sense, but just
> needed a fair amount of work.  But it closely matches how things work
> at the hardware and file system level, so it was the right approach.

What am I missing that makes this effort that different from streams?
Both are a lifetime hint.

> Suddenly dropping that and ignoring all comments really feels like
> someone urgently needs to pull a marketing stunt here.

I think someone is just trying to finally get this feature in, so it
can get used by customers, after many months of fairly unproductive
discussion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:03         ` Jens Axboe
@ 2024-10-02 15:13           ` Christoph Hellwig
  2024-10-02 15:17             ` Keith Busch
  2024-10-02 15:22             ` Jens Axboe
  0 siblings, 2 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-02 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
> > The previous stream separation approach made total sense, but just
> > needed a fair amount of work.  But it closely matches how things work
> > at the hardware and file system level, so it was the right approach.
> 
> What am I missing that makes this effort that different from streams?
> Both are a lifetime hint.

A stream has a lot less strings attached.  But hey, don't make me
argue for streams - you pushed the API in despite reservations back
then, and we've learned a lot since.

> > Suddenly dropping that and ignoring all comments really feels like
> > someone urgently needs to pull a marketing stunt here.
> 
> I think someone is just trying to finally get this feature in, so it
> can get used by customers, after many months of fairly unproductive
> discussion.

Well, he finally started on the right approach and gave it up after the
first round of feedback.  That's just a bit weird.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:13           ` Christoph Hellwig
@ 2024-10-02 15:17             ` Keith Busch
  2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:22             ` Jens Axboe
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-02 15:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

Nothing prevents future improvements in that direction. It just seems
out of scope for what Kanchan is trying to enable for his customer use
cases. This patch looks harmless.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:17             ` Keith Busch
@ 2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:33                 ` Keith Busch
  2024-10-02 15:47                 ` Martin K. Petersen
  0 siblings, 2 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-02 15:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:17:35AM -0600, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> > 
> > Well, he finally started on the right approach and gave it up after the
> > first round of feedback.  That's just a bit weird.
> 
> Nothing prevents future improvements in that direction. It just seems
> out of scope for what Kanchan is trying to enable for his customer use
> cases. This patch looks harmless.

It's not really.  Once we wire it up like this we mess up the ability
to use the feature in other ways.  Additionally the per-I/O hints are
simply broken if you want a file system in the way as last time,
and if I don't misremember you acknowledged in person last week.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:13           ` Christoph Hellwig
  2024-10-02 15:17             ` Keith Busch
@ 2024-10-02 15:22             ` Jens Axboe
  1 sibling, 0 replies; 86+ messages in thread
From: Jens Axboe @ 2024-10-02 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/2/24 9:13 AM, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
>>> The previous stream separation approach made total sense, but just
>>> needed a fair amount of work.  But it closely matches how things work
>>> at the hardware and file system level, so it was the right approach.
>>
>> What am I missing that makes this effort that different from streams?
>> Both are a lifetime hint.
> 
> A stream has a lot less strings attached.  But hey, don't make me
> argue for streams - you pushed the API in despite reservations back
> then, and we've learned a lot since.

Hah, I remember arguing for it back then! It was just a pain to use, but
thankfully we could kill it when devices didn't materialize with a
quality of implementation that made them useful. And then we just yanked
it. Nothing is forever, and particularly hints is something that can
always be ignored. Same for this.

>>> Suddenly dropping that and ignoring all comments really feels like
>>> someone urgently needs to pull a marketing stunt here.
>>
>> I think someone is just trying to finally get this feature in, so it
>> can get used by customers, after many months of fairly unproductive
>> discussion.
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

I'm a big fan of keep-it-simple-stupid, and then you can improve it down
the line. I think the simple approach stands just fine alone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:19               ` Christoph Hellwig
@ 2024-10-02 15:33                 ` Keith Busch
  2024-10-03 12:51                   ` Christoph Hellwig
  2024-10-02 15:47                 ` Martin K. Petersen
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-02 15:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 05:19:49PM +0200, Christoph Hellwig wrote:
> > Nothing prevents future improvements in that direction. It just seems
> > out of scope for what Kanchan is trying to enable for his customer use
> > cases. This patch looks harmless.
> 
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system in the way as last time,
> and if I don't misremember you acknowledged in person last week.

You remember right. I also explained the use cases for per-io hints are
to replace current passthrough users with generic read/write paths that
have all the memory guard rails, read/write stats, and other features
that you don't get with passthrough. This just lets you accomplish the
same placement hints with the nvme raw block device that you get with
the nvme char device.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:19               ` Christoph Hellwig
  2024-10-02 15:33                 ` Keith Busch
@ 2024-10-02 15:47                 ` Martin K. Petersen
  2024-10-02 18:34                   ` Bart Van Assche
  2024-10-03 12:54                   ` Christoph Hellwig
  1 sibling, 2 replies; 86+ messages in thread
From: Martin K. Petersen @ 2024-10-02 15:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz


Christoph,

>> Nothing prevents future improvements in that direction. It just seems
>> out of scope for what Kanchan is trying to enable for his customer use
>> cases. This patch looks harmless.
>
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system

Here is my take:

It is the kernel's job to manage the system's hardware resources and
arbitrate and share these resources optimally and fairly between all
running applications.

What irks me is defining application interfaces which fundamentally tell
the kernel that "these blocks are part of the same file".

The kernel already knows this. It is the very entity which provides that
abstraction. Why do we need an explicit interface to inform the kernel
that concurrent writes to the same file should have the same
"temperature" or need to go to the same "bin" on the storage device?
Shouldn't that just happen automatically?

Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
we are consistently failing to deliver something that actually works for
anything but a few specialized corner cases. I think that is a shame.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability
  2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
  2024-10-02 14:26       ` Pavel Begunkov
@ 2024-10-02 18:29       ` Bart Van Assche
  1 sibling, 0 replies; 86+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:29 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
> index 309ca72f2dfb..f4373a71ffed 100644
> --- a/include/linux/rw_hint.h
> +++ b/include/linux/rw_hint.h
> @@ -21,4 +21,28 @@ enum rw_hint {
>   static_assert(sizeof(enum rw_hint) == 1);
>   #endif
>   
> +#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
> +
> +static inline bool rw_hint_valid(u64 hint)
> +{
> +	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
> +	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
> +	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
> +	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
> +	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
> +	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
> +
> +	switch (hint) {
> +	case RWH_WRITE_LIFE_NOT_SET:
> +	case RWH_WRITE_LIFE_NONE:
> +	case RWH_WRITE_LIFE_SHORT:
> +	case RWH_WRITE_LIFE_MEDIUM:
> +	case RWH_WRITE_LIFE_LONG:
> +	case RWH_WRITE_LIFE_EXTREME:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}

Moving a function that is not in the hot path from a .c file to a .h
file is wrong. This increases the kernel size, slows down compilation
and makes implementation details visible to callers that should not have
access to these implementation details.

Bart.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:47                 ` Martin K. Petersen
@ 2024-10-02 18:34                   ` Bart Van Assche
  2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 12:54                   ` Christoph Hellwig
  1 sibling, 1 reply; 86+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:34 UTC (permalink / raw)
  To: Martin K. Petersen, Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/2/24 8:47 AM, Martin K. Petersen wrote:
> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".

Isn't FDP about communicating much more than only this information to
the block device, e.g. information about reclaim units? Although I'm
personally not interested in FDP, my colleagues were involved in the
standardization of FDP.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/3] nvme: enable FDP support
  2024-09-30 18:13     ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi
@ 2024-10-02 18:37       ` Bart Van Assche
  2024-10-03 12:55         ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Bart Van Assche @ 2024-10-02 18:37 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> 
> Userspace can send the data lifetime information using the write hints.
> The SCSI driver (sd) can already pass this information to the SCSI
> devices. This patch does the same for NVMe.
> 
> Fetch the placement-identifiers if the device supports FDP.
> The incoming write-hint is mapped to a placement-identifier, which in
> turn is set in the DSPEC field of the write command.

Is the description of this patch correct? The above description suggests
that FDP information is similar in nature to SCSI data lifetime
information. I don't think that's correct.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
                     ` (3 preceding siblings ...)
  2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
@ 2024-10-03  0:20   ` Bart Van Assche
  2024-10-15  5:50   ` Christoph Hellwig
  5 siblings, 0 replies; 86+ messages in thread
From: Bart Van Assche @ 2024-10-03  0:20 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> Another spin to incorporate the feedback from LPC and previous
> iterations. The series adds two capabilities:
> - FDP support at NVMe level (patch #1)
> - Per-io hinting via io_uring (patch #3)
> Patch #2 is needed to do per-io hints.
> 
> The motivation and interface details are present in the commit
> descriptions.

Something is missing from the patch descriptions. What is the goal
of this patch series? Is the goal to support FDP in filesystems in
the kernel, is the goal to support filesystems implemented in user
space or perhaps something else? I think this should be mentioned in
the cover letter.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:33                 ` Keith Busch
@ 2024-10-03 12:51                   ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:51 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 09:33:16AM -0600, Keith Busch wrote:
> You remember right. I also explained the use cases for per-io hints are
> to replace current passthrough users with generic read/write paths that
> have all the memory guard rails, read/write stats, and other features
> that you don't get with passthrough. This just lets you accomplish the
> same placement hints with the nvme raw block device that you get with
> the nvme char device.

Yes, and as I said before we need an opt-in from the file_ops, because
file systems can't support this.  And I'm speaking that from a position
of working on a file system that made the existing non-idea hints
work (to all fairness Hans did the hard work, but I was part of it).

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 15:47                 ` Martin K. Petersen
  2024-10-02 18:34                   ` Bart Van Assche
@ 2024-10-03 12:54                   ` Christoph Hellwig
  2024-10-03 22:14                     ` Jens Axboe
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:54 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 11:47:32AM -0400, Martin K. Petersen wrote:
> It is the kernel's job to manage the system's hardware resources and
> arbitrate and share these resources optimally and fairly between all
> running applications.

Exactly.

> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".
> 
> The kernel already knows this. It is the very entity which provides that
> abstraction. Why do we need an explicit interface to inform the kernel
> that concurrent writes to the same file should have the same
> "temperature" or need to go to the same "bin" on the storage device?
> Shouldn't that just happen automatically?

For file: yes.  The problem is when you have more files than buckets on
the device or file systems.  Typical enterprise SSDs support somewhere
between 8 and 16 write streams, and there typically is more data than
that.  So trying to group it somehow is good idea as not all files can
have their own bucket.

Allowing this inside a file like done in this patch set on the other
hand is pretty crazy.

> Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
> we are consistently failing to deliver something that actually works for
> anything but a few specialized corner cases. I think that is a shame.

Yes, it is.  And as someone who has been sitting in this group that is
because it's always someone in a place of power forcing down their version
because they got a promotіon, best of show award or whatever.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-02 18:34                   ` Bart Van Assche
@ 2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 21:48                       ` Keith Busch
  2024-10-04  6:21                       ` Javier González
  0 siblings, 2 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, Christoph Hellwig, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> Isn't FDP about communicating much more than only this information to
> the block device, e.g. information about reclaim units? Although I'm
> personally not interested in FDP, my colleagues were involved in the
> standardization of FDP.

Yes, it is.  And when I explained how to properly export this kind of
information can be implemented on top of the version Kanchan sent everyone
suddenly stopped diskussion technical points and went either silent or
all political.

So I think some peoples bonuses depend on not understanding the problem
I fear :(


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 1/3] nvme: enable FDP support
  2024-10-02 18:37       ` Bart Van Assche
@ 2024-10-03 12:55         ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

On Wed, Oct 02, 2024 at 11:37:11AM -0700, Bart Van Assche wrote:
> Is the description of this patch correct? The above description suggests
> that FDP information is similar in nature to SCSI data lifetime
> information. I don't think that's correct.

As you correctly point out, it is entirely incorrect.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:55                     ` Christoph Hellwig
@ 2024-10-03 21:48                       ` Keith Busch
  2024-10-03 22:00                         ` Bart Van Assche
  2024-10-04  6:21                       ` Javier González
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-03 21:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Jens Axboe, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 02:55:16PM +0200, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> > Isn't FDP about communicating much more than only this information to
> > the block device, e.g. information about reclaim units? Although I'm
> > personally not interested in FDP, my colleagues were involved in the
> > standardization of FDP.
> 
> Yes, it is.  And when I explained how to properly export this kind of
> information can be implemented on top of the version Kanchan sent everyone
> suddenly stopped diskussion technical points and went either silent or
> all political.

The nominals can mean whatever you want. If you want it to mean
"temperature", then that's what it means. If you want it to mean
something else, then don't use this.

These are hints at the end of the day. Nothing locks the kernel into
this if a better solution develops. As you know, it was ripped out
before.

> So I think some peoples bonuses depend on not understanding the problem
> I fear :(

The only "bonus" I have is not repeatedly explaining why people can't
use h/w features the way they want.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 21:48                       ` Keith Busch
@ 2024-10-03 22:00                         ` Bart Van Assche
  2024-10-03 22:12                           ` Jens Axboe
  2024-10-03 22:17                           ` Keith Busch
  0 siblings, 2 replies; 86+ messages in thread
From: Bart Van Assche @ 2024-10-03 22:00 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Martin K. Petersen, Jens Axboe, Kanchan Joshi, hare, sagi,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On 10/3/24 2:48 PM, Keith Busch wrote:
> The only "bonus" I have is not repeatedly explaining why people can't
> use h/w features the way they want.

Hi Keith,

Although that's a fair argument, what are the use cases for this patch
series? Filesystems in the kernel? Filesystems implemented in user
space? Perhaps something else?

This patch series adds new a new user space interface for passing hints
to storage devices (in io_uring). As we all know such interfaces are
hard to remove once these have been added.

We don't need new user space interfaces to support FDP for filesystems
in the kernel.

For filesystems implemented in user space, would using NVMe pass-through
be a viable approach? With this approach, no new user space interfaces
have to be added.

I'm wondering how to unblock FDP users without adding a new
controversial mechanism in the kernel.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:00                         ` Bart Van Assche
@ 2024-10-03 22:12                           ` Jens Axboe
  2024-10-03 22:17                           ` Keith Busch
  1 sibling, 0 replies; 86+ messages in thread
From: Jens Axboe @ 2024-10-03 22:12 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch, Christoph Hellwig
  Cc: Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/3/24 4:00 PM, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
>> The only "bonus" I have is not repeatedly explaining why people can't
>> use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.

It's a _hint_, I'm not sure how many times that has to be stated. The
kernel is free to ignore it, and in the future, it may very well do
that. We already had fcntl interfaces for streams, in the same vein, and
we yanked those in the sense that they ended up doing _nothing_. Did
things break because of that? Of course not. This is no different.

> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

pass-through already works, obviously - this is more about adding a
viable real interface for it. If it's a feature that is in devices AND
customers want to use, then pointing them at pass-through is a cop-out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:54                   ` Christoph Hellwig
@ 2024-10-03 22:14                     ` Jens Axboe
  2024-10-04  5:31                       ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Jens Axboe @ 2024-10-03 22:14 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen
  Cc: Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> For file: yes.  The problem is when you have more files than buckets on
> the device or file systems.  Typical enterprise SSDs support somewhere
> between 8 and 16 write streams, and there typically is more data than
> that.  So trying to group it somehow is good idea as not all files can
> have their own bucket.
> 
> Allowing this inside a file like done in this patch set on the other
> hand is pretty crazy.

I do agree that per-file hints are not ideal. In the spirit of making
some progress, how about we just retain per-io hints initially? We can
certainly make that work over dio. Yes buffered IO won't work initially,
but at least we're getting somewhere.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:00                         ` Bart Van Assche
  2024-10-03 22:12                           ` Jens Axboe
@ 2024-10-03 22:17                           ` Keith Busch
  1 sibling, 0 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-03 22:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Martin K. Petersen, Jens Axboe, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 03:00:21PM -0700, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
> > The only "bonus" I have is not repeatedly explaining why people can't
> > use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.
> 
> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

I really like the passthrough interface. This is flexible and enables
experimenting with all sorts of features.

But it exposes exploits and pitfalls. Just check the CVE's against it
(ex: CVE-2023-6238, note: same problem exists without metadata). The
consequences of misuse include memory corruption and private memory
leakage. The barriers for exploits are too low, and the potential for
mistakes are too high. A possible h/w solution requires an IOMMU, but
that's a non-starter for many users and only solves the private memory
vulnerabilities.

Passthrough also skips all the block layer features. I just today added
statistics to it staged for 6.13, but even that requires yet another
config step to enable it.

tl;dr: no one wants to use nvme passthrough in production long term.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 22:14                     ` Jens Axboe
@ 2024-10-04  5:31                       ` Christoph Hellwig
  2024-10-04  6:18                         ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-04  5:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Martin K. Petersen, Keith Busch, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> > For file: yes.  The problem is when you have more files than buckets on
> > the device or file systems.  Typical enterprise SSDs support somewhere
> > between 8 and 16 write streams, and there typically is more data than
> > that.  So trying to group it somehow is good idea as not all files can
> > have their own bucket.
> > 
> > Allowing this inside a file like done in this patch set on the other
> > hand is pretty crazy.
> 
> I do agree that per-file hints are not ideal. In the spirit of making
> some progress, how about we just retain per-io hints initially? We can
> certainly make that work over dio. Yes buffered IO won't work initially,
> but at least we're getting somewhere.

Huh?  Per I/O hints at the syscall level are the problem (see also the
reply from Martin).  Per file make total sense, but we need the file
system in control.

The real problem is further down the stack.  For the SCSI temperature
hints just passing them on make sense.  But when you map to some kind
of stream separation in the device, no matter if that is streams, FDP,
or various kinds of streams we don't even support in thing like CF
and SDcard, the driver is not the right place to map temperature hint
to streams.  The requires some kind of intelligence.  It could be
dirt simple and just do a best effort mapping of the temperature
hints 1:1 to separate write streams, or do a little mapping if there
is not enough of them which should work fine for a raw block device.

But one we have a file system things get more complicated:

 - the file system will want it's own streams for metadata and GC
 - even with that on beefy enough hardware you can have more streams
   then temperature levels, and the file system can and should
   do intelligen placement (based usually on files)

Or to summarize:  the per-file temperature hints make sense as a user
interface.  Per-I/O hints tend to be really messy at least if a file
system is involved.  Placing the temperatures to separate write streams
in the driver does not scale even to the most trivial write stream
aware file system implementations.

And for anyone who followed the previous discussions of the patches
none of this should been new, each point has been made at least three
times before.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  5:31                       ` Christoph Hellwig
@ 2024-10-04  6:18                         ` Javier González
  2024-10-04  6:27                           ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-04  6:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 07:31, Christoph Hellwig wrote:
>On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
>> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
>> > For file: yes.  The problem is when you have more files than buckets on
>> > the device or file systems.  Typical enterprise SSDs support somewhere
>> > between 8 and 16 write streams, and there typically is more data than
>> > that.  So trying to group it somehow is good idea as not all files can
>> > have their own bucket.
>> >
>> > Allowing this inside a file like done in this patch set on the other
>> > hand is pretty crazy.
>>
>> I do agree that per-file hints are not ideal. In the spirit of making
>> some progress, how about we just retain per-io hints initially? We can
>> certainly make that work over dio. Yes buffered IO won't work initially,
>> but at least we're getting somewhere.
>
>Huh?  Per I/O hints at the syscall level are the problem (see also the
>reply from Martin).  Per file make total sense, but we need the file
>system in control.
>
>The real problem is further down the stack.  For the SCSI temperature
>hints just passing them on make sense.  But when you map to some kind
>of stream separation in the device, no matter if that is streams, FDP,
>or various kinds of streams we don't even support in thing like CF
>and SDcard, the driver is not the right place to map temperature hint
>to streams.  The requires some kind of intelligence.  It could be
>dirt simple and just do a best effort mapping of the temperature
>hints 1:1 to separate write streams, or do a little mapping if there
>is not enough of them which should work fine for a raw block device.
>
>But one we have a file system things get more complicated:
>
> - the file system will want it's own streams for metadata and GC
> - even with that on beefy enough hardware you can have more streams
>   then temperature levels, and the file system can and should
>   do intelligen placement (based usually on files)
>
>Or to summarize:  the per-file temperature hints make sense as a user
>interface.  Per-I/O hints tend to be really messy at least if a file
>system is involved.  Placing the temperatures to separate write streams
>in the driver does not scale even to the most trivial write stream
>>
>And for anyone who followed the previous discussions of the patches
>none of this should been new, each point has been made at least three
>times before.

Looking at the work you and Hans have been doing on XFS, it seems you
have been successful at mapping the semantics of the temperature to
zones (which has no semantics, just as FDP).

    What is the difference between the mapping in zones and for FDP?

The whole point is using an existing interface to cover the use-case of
people wanting hints in block. If you have a use-case for enabling hints
on file, let's work on this aftterwards.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-03 12:55                     ` Christoph Hellwig
  2024-10-03 21:48                       ` Keith Busch
@ 2024-10-04  6:21                       ` Javier González
  2024-10-04  6:24                         ` Christoph Hellwig
  1 sibling, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-04  6:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 03.10.2024 14:55, Christoph Hellwig wrote:
>On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
>> Isn't FDP about communicating much more than only this information to
>> the block device, e.g. information about reclaim units? Although I'm
>> personally not interested in FDP, my colleagues were involved in the
>> standardization of FDP.
>
>Yes, it is.  And when I explained how to properly export this kind of
>information can be implemented on top of the version Kanchan sent everyone
>suddenly stopped diskussion technical points and went either silent or
>all political.
>
>So I think some peoples bonuses depend on not understanding the problem
>I fear :(
>

Please, don't.

Childish comments like this delegitimize the work that a lot of people
are doing in Linux.

We all operate under the assumption that folks here know how to wear two
hants, and that there is no such thing as so specific incentives
neither to push _nor_ block upstream contributions without a use-case
and technical background.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:21                       ` Javier González
@ 2024-10-04  6:24                         ` Christoph Hellwig
  2024-10-04  6:59                           ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-04  6:24 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen,
	Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g

On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>> So I think some peoples bonuses depend on not understanding the problem
>> I fear :(
>>
>
> Please, don't.
>
> Childish comments like this delegitimize the work that a lot of people
> are doing in Linux.

It's the only very sarcastic explanation I can come up with to explain
this discussion, where people from the exactly two companies where this
might be bonus material love political discussion and drop dead when it
turns technical.

I'm probably wrong, but I've run out of other explanations.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:18                         ` Javier González
@ 2024-10-04  6:27                           ` Christoph Hellwig
  2024-10-04  6:52                             ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-04  6:27 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>> And for anyone who followed the previous discussions of the patches
>> none of this should been new, each point has been made at least three
>> times before.
>
> Looking at the work you and Hans have been doing on XFS, it seems you
> have been successful at mapping the semantics of the temperature to
> zones (which has no semantics, just as FDP).
>
>    What is the difference between the mapping in zones and for FDP?

Probably not much, except for all the pitfalls in the FDP not quite hint
not madatory design.

> The whole point is using an existing interface to cover the use-case of
> people wanting hints in block.

And that's fine.  And that point all the way back for month is that
doing a complete dumb mapping in the driver for that is fundamentally
wrong.  Kanchan's previous series did about 50% of the work to get
it rid, but then everyone dropped dead and played politics.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:27                           ` Christoph Hellwig
@ 2024-10-04  6:52                             ` Javier González
  2024-10-04 12:30                               ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-04  6:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 08:27, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>>> And for anyone who followed the previous discussions of the patches
>>> none of this should been new, each point has been made at least three
>>> times before.
>>
>> Looking at the work you and Hans have been doing on XFS, it seems you
>> have been successful at mapping the semantics of the temperature to
>> zones (which has no semantics, just as FDP).
>>
>>    What is the difference between the mapping in zones and for FDP?
>
>Probably not much, except for all the pitfalls in the FDP not quite hint
>not madatory design.

It is too late to change the first version of FDP productas, and the
solution of "go back to NVMe and make it match SCSI" is not realistic.

We have drives from several companies out there supporting this version
of FDP. And we have customers that are actively using them. It is our
collective responsibility to support them.

So, considerign that file system _are_ able to use temperature hints and
actually make them work, why don't we support FDP the same way we are
supporting zones so that people can use it in production?

I agree that down the road, an interface that allows hints (many more
than 5!) is needed. And in my opinion, this interface should not have
semintics attached to it, just a hint ID, #hints, and enough space to
put 100s of them to support storage node deployments. But this needs to
come from the users of the hints / zones / streams / etc,  not from
us vendors. We do not have neither details on how they deploy these
features at scale, nor the workloads to validate the results. Anything
else will probably just continue polluting the storage stack with more
interfaces that are not used and add to the problem of data placement
fragmentation.

>
>> The whole point is using an existing interface to cover the use-case of
>> people wanting hints in block.
>
>And that's fine.  And that point all the way back for month is that
>doing a complete dumb mapping in the driver for that is fundamentally
>wrong.  Kanchan's previous series did about 50% of the work to get
>it rid, but then everyone dropped dead and played politics.

The issue is that the first series of this patch, which is as simple as
it gets, hit the list in May. Since then we are down paths that lead
nowhere. So the line between real technical feedback that leads to
a feature being merged, and technical misleading to make people be a
busy bee becomes very thin. In the whole data placement effort, we have
been down this path many times, unfortunately...

At LPC, we discussed about the work you did in XFS and it became clear
that coming back to the first version of the patches was the easiest way
to support the use-case that current customers have. It is a pity you
did not attend the conference and could make a point against this line
of thought there.

So summarizing, all folks in this thread are asking is for a path to
move forward for solving the block use-case, which is the only one that
I am aware of requiring kernel support. This has been validated with
real workloads, so it is very specific and tangible.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:24                         ` Christoph Hellwig
@ 2024-10-04  6:59                           ` Javier González
  2024-10-04 12:32                             ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-04  6:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 04.10.2024 08:24, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>>> So I think some peoples bonuses depend on not understanding the problem
>>> I fear :(
>>>
>>
>> Please, don't.
>>
>> Childish comments like this delegitimize the work that a lot of people
>> are doing in Linux.
>
>It's the only very sarcastic explanation I can come up with to explain
>this discussion, where people from the exactly two companies where this
>might be bonus material love political discussion and drop dead when it
>turns technical.

FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
Microship, Marvell, FADU, WDC, and Samsung.

The fact that 2 of these companies are the ones starting to build the
Linux ecosystem should not surprise you, as it is the way things work
normally.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:52                             ` Javier González
@ 2024-10-04 12:30                               ` Christoph Hellwig
  2024-10-07 10:10                                 ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:30 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> So, considerign that file system _are_ able to use temperature hints and
> actually make them work, why don't we support FDP the same way we are
> supporting zones so that people can use it in production?

Because apparently no one has tried it.  It should be possible in theory,
but for example unless you have power of 2 reclaim unit size size it
won't work very well with XFS where the AGs/RTGs must be power of two
aligned in the LBA space, except by overprovisioning the LBA space vs
the capacity actually used.

> I agree that down the road, an interface that allows hints (many more
> than 5!) is needed. And in my opinion, this interface should not have
> semintics attached to it, just a hint ID, #hints, and enough space to
> put 100s of them to support storage node deployments. But this needs to
> come from the users of the hints / zones / streams / etc,  not from
> us vendors. We do not have neither details on how they deploy these
> features at scale, nor the workloads to validate the results. Anything
> else will probably just continue polluting the storage stack with more
> interfaces that are not used and add to the problem of data placement
> fragmentation.

Please always mentioned what layer you are talking about.  At the syscall
level the temperatur hints are doing quite ok.  A full stream separation
would obviously be a lot better, as would be communicating the zone /
reclaim unit / etc size.

As an interface to a driver that doesn't natively speak temperature
hint on the other hand it doesn't work at all.

> The issue is that the first series of this patch, which is as simple as
> it gets, hit the list in May. Since then we are down paths that lead
> nowhere. So the line between real technical feedback that leads to
> a feature being merged, and technical misleading to make people be a
> busy bee becomes very thin. In the whole data placement effort, we have
> been down this path many times, unfortunately...

Well, the previous round was the first one actually trying to address the
fundamental issue after 4 month.  And then after a first round of feedback
it gets shutdown somehow out of nowhere.  As a maintainer and review that
is the kinda of contributors I have a hard time taking serious.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04  6:59                           ` Javier González
@ 2024-10-04 12:32                             ` Christoph Hellwig
  2024-10-07 11:29                               ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:32 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen,
	Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g

On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
> Microship, Marvell, FADU, WDC, and Samsung.
>
> The fact that 2 of these companies are the ones starting to build the
> Linux ecosystem should not surprise you, as it is the way things work
> normally.

That's not the point.  There is one company that drivers entirely pointless
marketing BS, and that one is pretty central here.  The same company
that said FDP has absolutely no іntent to work on Linux and fought my
initial attempt to make the protocol not totally unusable ony layer system.
And no, that's not Samsung.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04 12:30                               ` Christoph Hellwig
@ 2024-10-07 10:10                                 ` Javier González
  2024-10-08 10:06                                   ` Hans Holmberg
  2024-10-08 12:25                                   ` Christoph Hellwig
  0 siblings, 2 replies; 86+ messages in thread
From: Javier González @ 2024-10-07 10:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 04.10.2024 14:30, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> So, considerign that file system _are_ able to use temperature hints and
>> actually make them work, why don't we support FDP the same way we are
>> supporting zones so that people can use it in production?
>
>Because apparently no one has tried it.  It should be possible in theory,
>but for example unless you have power of 2 reclaim unit size size it
>won't work very well with XFS where the AGs/RTGs must be power of two
>aligned in the LBA space, except by overprovisioning the LBA space vs
>the capacity actually used.

This is good. I think we should have at least a FS POC with data
placement support to be able to drive conclusions on how the interface
and requirements should be. Until we have that, we can support the
use-cases that we know customers are asking for, i.e., block-level hints
through the existing temperature API.

>
>> I agree that down the road, an interface that allows hints (many more
>> than 5!) is needed. And in my opinion, this interface should not have
>> semintics attached to it, just a hint ID, #hints, and enough space to
>> put 100s of them to support storage node deployments. But this needs to
>> come from the users of the hints / zones / streams / etc,  not from
>> us vendors. We do not have neither details on how they deploy these
>> features at scale, nor the workloads to validate the results. Anything
>> else will probably just continue polluting the storage stack with more
>> interfaces that are not used and add to the problem of data placement
>> fragmentation.
>
>Please always mentioned what layer you are talking about.  At the syscall
>level the temperatur hints are doing quite ok.  A full stream separation
>would obviously be a lot better, as would be communicating the zone /
>reclaim unit / etc size.

I mean at the syscall level. But as mentioned above, we need to be very
sure that we have a clear use-case for that. If we continue seeing hints
being use in NVMe or other protocols, and the number increase
significantly, we can deal with it later on.

>
>As an interface to a driver that doesn't natively speak temperature
>hint on the other hand it doesn't work at all.
>
>> The issue is that the first series of this patch, which is as simple as
>> it gets, hit the list in May. Since then we are down paths that lead
>> nowhere. So the line between real technical feedback that leads to
>> a feature being merged, and technical misleading to make people be a
>> busy bee becomes very thin. In the whole data placement effort, we have
>> been down this path many times, unfortunately...
>
>Well, the previous round was the first one actually trying to address the
>fundamental issue after 4 month.  And then after a first round of feedback
>it gets shutdown somehow out of nowhere.  As a maintainer and review that
>is the kinda of contributors I have a hard time taking serious.

I am not sure I understand what you mean in the last sentece, so I will
not respond filling blanks with a bad interpretation.

In summary, what we are asking for is to take the patches that cover the
current use-case, and work together on what might be needed for better
FS support. For this, it seems you and Hans have a good idea of what you
want to have based on XFS. We can help review or do part of the work,
but trying to guess our way will only delay existing customers using
existing HW.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-04 12:32                             ` Christoph Hellwig
@ 2024-10-07 11:29                               ` Javier González
  2024-10-08 12:27                                 ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-07 11:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 04.10.2024 14:32, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
>> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
>> Microship, Marvell, FADU, WDC, and Samsung.
>>
>> The fact that 2 of these companies are the ones starting to build the
>> Linux ecosystem should not surprise you, as it is the way things work
>> normally.
>
>That's not the point.  There is one company that drivers entirely pointless
>marketing BS, and that one is pretty central here.  The same company
>that said FDP has absolutely no іntent to work on Linux and fought my
>initial attempt to make the protocol not totally unusable ony layer system.
>And no, that's not Samsung.

So you had an interaction in the working group, your feedback was not
taking into consideration by the authors, and the result is that FDP
cannot be supported in Linux as a consequence of that? Come on...

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-07 10:10                                 ` Javier González
@ 2024-10-08 10:06                                   ` Hans Holmberg
  2024-10-09 14:36                                     ` Javier Gonzalez
  2024-10-08 12:25                                   ` Christoph Hellwig
  1 sibling, 1 reply; 86+ messages in thread
From: Hans Holmberg @ 2024-10-08 10:06 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> wrote:
>
> On 04.10.2024 14:30, Christoph Hellwig wrote:
> >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> >> So, considerign that file system _are_ able to use temperature hints and
> >> actually make them work, why don't we support FDP the same way we are
> >> supporting zones so that people can use it in production?
> >
> >Because apparently no one has tried it.  It should be possible in theory,
> >but for example unless you have power of 2 reclaim unit size size it
> >won't work very well with XFS where the AGs/RTGs must be power of two
> >aligned in the LBA space, except by overprovisioning the LBA space vs
> >the capacity actually used.
>
> This is good. I think we should have at least a FS POC with data
> placement support to be able to drive conclusions on how the interface
> and requirements should be. Until we have that, we can support the
> use-cases that we know customers are asking for, i.e., block-level hints
> through the existing temperature API.
>
> >
> >> I agree that down the road, an interface that allows hints (many more
> >> than 5!) is needed. And in my opinion, this interface should not have
> >> semintics attached to it, just a hint ID, #hints, and enough space to
> >> put 100s of them to support storage node deployments. But this needs to
> >> come from the users of the hints / zones / streams / etc,  not from
> >> us vendors. We do not have neither details on how they deploy these
> >> features at scale, nor the workloads to validate the results. Anything
> >> else will probably just continue polluting the storage stack with more
> >> interfaces that are not used and add to the problem of data placement
> >> fragmentation.
> >
> >Please always mentioned what layer you are talking about.  At the syscall
> >level the temperatur hints are doing quite ok.  A full stream separation
> >would obviously be a lot better, as would be communicating the zone /
> >reclaim unit / etc size.
>
> I mean at the syscall level. But as mentioned above, we need to be very
> sure that we have a clear use-case for that. If we continue seeing hints
> being use in NVMe or other protocols, and the number increase
> significantly, we can deal with it later on.
>
> >
> >As an interface to a driver that doesn't natively speak temperature
> >hint on the other hand it doesn't work at all.
> >
> >> The issue is that the first series of this patch, which is as simple as
> >> it gets, hit the list in May. Since then we are down paths that lead
> >> nowhere. So the line between real technical feedback that leads to
> >> a feature being merged, and technical misleading to make people be a
> >> busy bee becomes very thin. In the whole data placement effort, we have
> >> been down this path many times, unfortunately...
> >
> >Well, the previous round was the first one actually trying to address the
> >fundamental issue after 4 month.  And then after a first round of feedback
> >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> >is the kinda of contributors I have a hard time taking serious.
>
> I am not sure I understand what you mean in the last sentece, so I will
> not respond filling blanks with a bad interpretation.
>
> In summary, what we are asking for is to take the patches that cover the
> current use-case, and work together on what might be needed for better
> FS support. For this, it seems you and Hans have a good idea of what you
> want to have based on XFS. We can help review or do part of the work,
> but trying to guess our way will only delay existing customers using
> existing HW.

After reading the whole thread, I end up wondering why we need to rush the
support for a single use case through instead of putting the architecture
in place for properly supporting this new type of hardware from the start
throughout the stack.

Even for user space consumers of raw block devices, is the last version
of the patch set good enough?

* It severely cripples the data separation capabilities as only a handful of
  data placement buckets are supported

* It just won't work if there is more than one user application per storage
  device as different applications data streams will be mixed at the nvme
  driver level..

While Christoph has already outlined what would be desirable from a
file system point of view, I don't have the answer to what would be the overall
best design for FDP. I would like to say that it looks to me like we need to
consider more than more than the early adoption use cases and make sure we
make the most of the hardware capabilities via logical abstractions that
would be compatible with a wider range of storage devices.

Figuring the right way forward is tricky, but why not just let it take the time
that is needed to sort this out while early users explore how to use FDP
drives and share the results?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-07 10:10                                 ` Javier González
  2024-10-08 10:06                                   ` Hans Holmberg
@ 2024-10-08 12:25                                   ` Christoph Hellwig
  2024-10-08 14:44                                     ` Keith Busch
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-08 12:25 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote:
> In summary, what we are asking for is to take the patches that cover the
> current use-case, and work together on what might be needed for better
> FS support.

And I really do not think it is a good idea.  For one it actually
works against the stated intent of the FDP spec.  Second extending
the hints to per per-I/O in the io_uring patch is actively breaking
the nice per-file I/O hint abstraction we have right now, and is
really unsuitable when actually used on a file and not just a block
device.  And if you are only on a block device I think passthrough
of some form is still the far better option, despite the problems
with it mentioned by Keith.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-07 11:29                               ` Javier González
@ 2024-10-08 12:27                                 ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-08 12:27 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen,
	Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g

On Mon, Oct 07, 2024 at 01:29:31PM +0200, Javier González wrote:
>> That's not the point.  There is one company that drivers entirely pointless
>> marketing BS, and that one is pretty central here.  The same company
>> that said FDP has absolutely no іntent to work on Linux and fought my
>> initial attempt to make the protocol not totally unusable ony layer system.
>> And no, that's not Samsung.
>
> So you had an interaction in the working group, your feedback was not
> taking into consideration by the authors, and the result is that FDP
> cannot be supported in Linux as a consequence of that? Come on...

No, what I am saying is that the "small" FDP group that was doing the
development while keeping it doing away from the group insisted that
FDP is only for use in userspace drivers, and even getting the basis
in to properly make it suitable for a semi-multi tenant setup like
Linux io_uring passthrough was not welcome and actually fought tooth
and nail by the particular one particular company.  It a long talk to
the head of the NVMe board to even get this sorted out.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-08 12:25                                   ` Christoph Hellwig
@ 2024-10-08 14:44                                     ` Keith Busch
  2024-10-09  9:28                                       ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-08 14:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Javier González, Jens Axboe, Martin K. Petersen,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Tue, Oct 08, 2024 at 02:25:35PM +0200, Christoph Hellwig wrote:
> On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote:
> > In summary, what we are asking for is to take the patches that cover the
> > current use-case, and work together on what might be needed for better
> > FS support.
> 
> And I really do not think it is a good idea.  For one it actually
> works against the stated intent of the FDP spec.  Second extending
> the hints to per per-I/O in the io_uring patch is actively breaking
> the nice per-file I/O hint abstraction we have right now, and is
> really unsuitable when actually used on a file and not just a block
> device.  And if you are only on a block device I think passthrough
> of some form is still the far better option, despite the problems
> with it mentioned by Keith.

Then let's just continue with patches 1 and 2. They introduce no new
user or kernel APIs, and people have already reported improvements using
it. Further, it is just a hint, it doesn't lock the kernel into anything
that may hinder future inovations and enhancements. So let's unblock
users and refocus *our* time to something more productive, please?

And to be honest, the per-io hints for generic read/write use is only
valuable for my users if metadata is also exposed to userspace. I know
Javier's team is working on that in parallel, so per-io hints are a
lower priority for me until that part settles.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-08 14:44                                     ` Keith Busch
@ 2024-10-09  9:28                                       ` Christoph Hellwig
  2024-10-09 15:06                                         ` Keith Busch
  2024-10-09 16:28                                         ` Nitesh Shetty
  0 siblings, 2 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-09  9:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Javier González, Jens Axboe,
	Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g

On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
> Then let's just continue with patches 1 and 2. They introduce no new
> user or kernel APIs, and people have already reported improvements using
> it.

They are still not any way actually exposing the FDP functionality
in the standard though.  How is your application going to align
anything to the reclaim unit?  Or is this another of the cases where
as a hyperscaler you just "know" from the data sheet?

But also given that the submitter completely disappeared and refuses
to even discuss his patches I thing they are simply abandonware at
this point anyway.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-08 10:06                                   ` Hans Holmberg
@ 2024-10-09 14:36                                     ` Javier Gonzalez
  2024-10-10  6:40                                       ` Hans Holmberg
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-09 14:36 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]



> -----Original Message-----
> From: Hans Holmberg <[email protected]>
> Sent: Tuesday, October 8, 2024 12:07 PM
> To: Javier Gonzalez <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K.
> Petersen <[email protected]>; Keith Busch <[email protected]>;
> Kanchan Joshi <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]>
> wrote:
> >
> > On 04.10.2024 14:30, Christoph Hellwig wrote:
> > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> > >> So, considerign that file system _are_ able to use temperature hints and
> > >> actually make them work, why don't we support FDP the same way we are
> > >> supporting zones so that people can use it in production?
> > >
> > >Because apparently no one has tried it.  It should be possible in theory,
> > >but for example unless you have power of 2 reclaim unit size size it
> > >won't work very well with XFS where the AGs/RTGs must be power of two
> > >aligned in the LBA space, except by overprovisioning the LBA space vs
> > >the capacity actually used.
> >
> > This is good. I think we should have at least a FS POC with data
> > placement support to be able to drive conclusions on how the interface
> > and requirements should be. Until we have that, we can support the
> > use-cases that we know customers are asking for, i.e., block-level hints
> > through the existing temperature API.
> >
> > >
> > >> I agree that down the road, an interface that allows hints (many more
> > >> than 5!) is needed. And in my opinion, this interface should not have
> > >> semintics attached to it, just a hint ID, #hints, and enough space to
> > >> put 100s of them to support storage node deployments. But this needs to
> > >> come from the users of the hints / zones / streams / etc,  not from
> > >> us vendors. We do not have neither details on how they deploy these
> > >> features at scale, nor the workloads to validate the results. Anything
> > >> else will probably just continue polluting the storage stack with more
> > >> interfaces that are not used and add to the problem of data placement
> > >> fragmentation.
> > >
> > >Please always mentioned what layer you are talking about.  At the syscall
> > >level the temperatur hints are doing quite ok.  A full stream separation
> > >would obviously be a lot better, as would be communicating the zone /
> > >reclaim unit / etc size.
> >
> > I mean at the syscall level. But as mentioned above, we need to be very
> > sure that we have a clear use-case for that. If we continue seeing hints
> > being use in NVMe or other protocols, and the number increase
> > significantly, we can deal with it later on.
> >
> > >
> > >As an interface to a driver that doesn't natively speak temperature
> > >hint on the other hand it doesn't work at all.
> > >
> > >> The issue is that the first series of this patch, which is as simple as
> > >> it gets, hit the list in May. Since then we are down paths that lead
> > >> nowhere. So the line between real technical feedback that leads to
> > >> a feature being merged, and technical misleading to make people be a
> > >> busy bee becomes very thin. In the whole data placement effort, we have
> > >> been down this path many times, unfortunately...
> > >
> > >Well, the previous round was the first one actually trying to address the
> > >fundamental issue after 4 month.  And then after a first round of feedback
> > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> > >is the kinda of contributors I have a hard time taking serious.
> >
> > I am not sure I understand what you mean in the last sentece, so I will
> > not respond filling blanks with a bad interpretation.
> >
> > In summary, what we are asking for is to take the patches that cover the
> > current use-case, and work together on what might be needed for better
> > FS support. For this, it seems you and Hans have a good idea of what you
> > want to have based on XFS. We can help review or do part of the work,
> > but trying to guess our way will only delay existing customers using
> > existing HW.
> 
> After reading the whole thread, I end up wondering why we need to rush the
> support for a single use case through instead of putting the architecture
> in place for properly supporting this new type of hardware from the start
> throughout the stack.

This is not a rush. We have been supporting this use case through passthru for 
over 1/2 year with code already upstream in Cachelib. This is mature enough as 
to move into the block layer, which is what the end user wants to do either way.

This is though a very good point. This is why we upstreamed passthru at the 
time; so people can experiment, validate, and upstream only when there is a 
clear path.

> 
> Even for user space consumers of raw block devices, is the last version
> of the patch set good enough?
> 
> * It severely cripples the data separation capabilities as only a handful of
>   data placement buckets are supported

I could understand from your presentation at LPC, and late looking at the code that 
is available that you have been successful at getting good results with the existing 
interface in XFS. The mapping form the temperature semantics to zones (no semantics)
is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space 
structures is great.

> 
> * It just won't work if there is more than one user application per storage
>   device as different applications data streams will be mixed at the nvme
>   driver level..

For now this use-case is not clear. Folks working on it are using passthru. When we
have a more clear understanding of what is needed, we might need changes in the kernel.

If you see a need for this on the work that you are doing, by all means, please send patches.
As I said at LPC, we can work together on this.

> 
> While Christoph has already outlined what would be desirable from a
> file system point of view, I don't have the answer to what would be the overall
> best design for FDP. I would like to say that it looks to me like we need to
> consider more than more than the early adoption use cases and make sure we
> make the most of the hardware capabilities via logical abstractions that
> would be compatible with a wider range of storage devices.
> 
> Figuring the right way forward is tricky, but why not just let it take the time
> that is needed to sort this out while early users explore how to use FDP
> drives and share the results?

I agree that we might need a new interface to support more hints, beyond the temperatures. 
Or maybe not. We would not know until someone comes with a use case. We have made the 
mistake in the past of treating internal research as upstreamable work. I know can see that 
this simply complicates the in-kernel and user-space APIs.

The existing API is usable and requires no changes. There is hardware. There are customers. 
There are applications with upstream support which have been tested with passthru (the 
early results you mention). And the wiring to NVMe is _very_ simple. There is no reason 
not to take this in, and then we will see what new interfaces we might need in the future.

I would much rather spend time in discussing ideas with you and others on a potential
future API than arguing about the validity of an _existing_ one. 



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-09  9:28                                       ` Christoph Hellwig
@ 2024-10-09 15:06                                         ` Keith Busch
       [not found]                                           ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com>
  2024-10-10  9:10                                           ` Christoph Hellwig
  2024-10-09 16:28                                         ` Nitesh Shetty
  1 sibling, 2 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-09 15:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Javier González, Jens Axboe, Martin K. Petersen,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
> > Then let's just continue with patches 1 and 2. They introduce no new
> > user or kernel APIs, and people have already reported improvements using
> > it.
> 
> They are still not any way actually exposing the FDP functionality
> in the standard though.  How is your application going to align
> anything to the reclaim unit?  Or is this another of the cases where
> as a hyperscaler you just "know" from the data sheet?

As far as I know, this is an inconsequential spec detail that is not
being considered by any applications testing this. And yet, the expected
imrpovements are still there, so I don't see a point holding this up for
that reason.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-09  9:28                                       ` Christoph Hellwig
  2024-10-09 15:06                                         ` Keith Busch
@ 2024-10-09 16:28                                         ` Nitesh Shetty
  1 sibling, 0 replies; 86+ messages in thread
From: Nitesh Shetty @ 2024-10-09 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Javier González, Jens Axboe, Martin K. Petersen,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]

On 09/10/24 11:28AM, Christoph Hellwig wrote:
>On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
>> Then let's just continue with patches 1 and 2. They introduce no new
>> user or kernel APIs, and people have already reported improvements using
>> it.
>
>They are still not any way actually exposing the FDP functionality
>in the standard though.  How is your application going to align
>anything to the reclaim unit?  Or is this another of the cases where
>as a hyperscaler you just "know" from the data sheet?

I think Keith already[1] mentioned a couple of times in previous replies,
this is an optional feature, if experiments doesn't give better
results, FDP hints can be disabled.

>
>But also given that the submitter completely disappeared and refuses
>to even discuss his patches I thing they are simply abandonware at
>this point anyway.

Kanchan is away due to personal reasons, he is expected to be back in a
week or so.

Regards,
Nitesh Shetty

[1] https://lore.kernel.org/all/[email protected]/

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-09 14:36                                     ` Javier Gonzalez
@ 2024-10-10  6:40                                       ` Hans Holmberg
  2024-10-10  7:13                                         ` Javier Gonzalez
  0 siblings, 1 reply; 86+ messages in thread
From: Hans Holmberg @ 2024-10-10  6:40 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Hans Holmberg <[email protected]>
> > Sent: Tuesday, October 8, 2024 12:07 PM
> > To: Javier Gonzalez <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K.
> > Petersen <[email protected]>; Keith Busch <[email protected]>;
> > Kanchan Joshi <[email protected]>; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected]
> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> >
> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]>
> > wrote:
> > >
> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> > > >> So, considerign that file system _are_ able to use temperature hints and
> > > >> actually make them work, why don't we support FDP the same way we are
> > > >> supporting zones so that people can use it in production?
> > > >
> > > >Because apparently no one has tried it.  It should be possible in theory,
> > > >but for example unless you have power of 2 reclaim unit size size it
> > > >won't work very well with XFS where the AGs/RTGs must be power of two
> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
> > > >the capacity actually used.
> > >
> > > This is good. I think we should have at least a FS POC with data
> > > placement support to be able to drive conclusions on how the interface
> > > and requirements should be. Until we have that, we can support the
> > > use-cases that we know customers are asking for, i.e., block-level hints
> > > through the existing temperature API.
> > >
> > > >
> > > >> I agree that down the road, an interface that allows hints (many more
> > > >> than 5!) is needed. And in my opinion, this interface should not have
> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
> > > >> put 100s of them to support storage node deployments. But this needs to
> > > >> come from the users of the hints / zones / streams / etc,  not from
> > > >> us vendors. We do not have neither details on how they deploy these
> > > >> features at scale, nor the workloads to validate the results. Anything
> > > >> else will probably just continue polluting the storage stack with more
> > > >> interfaces that are not used and add to the problem of data placement
> > > >> fragmentation.
> > > >
> > > >Please always mentioned what layer you are talking about.  At the syscall
> > > >level the temperatur hints are doing quite ok.  A full stream separation
> > > >would obviously be a lot better, as would be communicating the zone /
> > > >reclaim unit / etc size.
> > >
> > > I mean at the syscall level. But as mentioned above, we need to be very
> > > sure that we have a clear use-case for that. If we continue seeing hints
> > > being use in NVMe or other protocols, and the number increase
> > > significantly, we can deal with it later on.
> > >
> > > >
> > > >As an interface to a driver that doesn't natively speak temperature
> > > >hint on the other hand it doesn't work at all.
> > > >
> > > >> The issue is that the first series of this patch, which is as simple as
> > > >> it gets, hit the list in May. Since then we are down paths that lead
> > > >> nowhere. So the line between real technical feedback that leads to
> > > >> a feature being merged, and technical misleading to make people be a
> > > >> busy bee becomes very thin. In the whole data placement effort, we have
> > > >> been down this path many times, unfortunately...
> > > >
> > > >Well, the previous round was the first one actually trying to address the
> > > >fundamental issue after 4 month.  And then after a first round of feedback
> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> > > >is the kinda of contributors I have a hard time taking serious.
> > >
> > > I am not sure I understand what you mean in the last sentece, so I will
> > > not respond filling blanks with a bad interpretation.
> > >
> > > In summary, what we are asking for is to take the patches that cover the
> > > current use-case, and work together on what might be needed for better
> > > FS support. For this, it seems you and Hans have a good idea of what you
> > > want to have based on XFS. We can help review or do part of the work,
> > > but trying to guess our way will only delay existing customers using
> > > existing HW.
> >
> > After reading the whole thread, I end up wondering why we need to rush the
> > support for a single use case through instead of putting the architecture
> > in place for properly supporting this new type of hardware from the start
> > throughout the stack.
>
> This is not a rush. We have been supporting this use case through passthru for
> over 1/2 year with code already upstream in Cachelib. This is mature enough as
> to move into the block layer, which is what the end user wants to do either way.
>
> This is though a very good point. This is why we upstreamed passthru at the
> time; so people can experiment, validate, and upstream only when there is a
> clear path.
>
> >
> > Even for user space consumers of raw block devices, is the last version
> > of the patch set good enough?
> >
> > * It severely cripples the data separation capabilities as only a handful of
> >   data placement buckets are supported
>
> I could understand from your presentation at LPC, and late looking at the code that
> is available that you have been successful at getting good results with the existing
> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
> structures is great.

No, we don't map data directly to zones using lifetime hints. In fact,
lifetime hints contribute only a
relatively small part  (~10% extra write amp reduction, see the
rocksdb benchmark results).
Segregating data by file is the most important part of the data
placement heuristic, at least
for this type of workload.

>
> >
> > * It just won't work if there is more than one user application per storage
> >   device as different applications data streams will be mixed at the nvme
> >   driver level..
>
> For now this use-case is not clear. Folks working on it are using passthru. When we
> have a more clear understanding of what is needed, we might need changes in the kernel.
>
> If you see a need for this on the work that you are doing, by all means, please send patches.
> As I said at LPC, we can work together on this.
>
> >
> > While Christoph has already outlined what would be desirable from a
> > file system point of view, I don't have the answer to what would be the overall
> > best design for FDP. I would like to say that it looks to me like we need to
> > consider more than more than the early adoption use cases and make sure we
> > make the most of the hardware capabilities via logical abstractions that
> > would be compatible with a wider range of storage devices.
> >
> > Figuring the right way forward is tricky, but why not just let it take the time
> > that is needed to sort this out while early users explore how to use FDP
> > drives and share the results?
>
> I agree that we might need a new interface to support more hints, beyond the temperatures.
> Or maybe not. We would not know until someone comes with a use case. We have made the
> mistake in the past of treating internal research as upstreamable work. I know can see that
> this simply complicates the in-kernel and user-space APIs.
>
> The existing API is usable and requires no changes. There is hardware. There are customers.
> There are applications with upstream support which have been tested with passthru (the
> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
> not to take this in, and then we will see what new interfaces we might need in the future.
>
> I would much rather spend time in discussing ideas with you and others on a potential
> future API than arguing about the validity of an _existing_ one.
>

Yes, but while FDP support could be improved later on(happy to help if
that'll be the case),
I'm just afraid that less work now defining the way data placement is
exposed is going to
result in a bigger mess later when more use cases will be considered.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
       [not found]                                           ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com>
@ 2024-10-10  7:07                                             ` Javier González
  2024-10-10  9:13                                               ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-10  7:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Kanchan Joshi,
	hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g

On 09.10.2024 09:06, Keith Busch wrote:
>On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote:
>> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
>> > Then let's just continue with patches 1 and 2. They introduce no new
>> > user or kernel APIs, and people have already reported improvements using
>> > it.
>>
>> They are still not any way actually exposing the FDP functionality
>> in the standard though.  How is your application going to align
>> anything to the reclaim unit?  Or is this another of the cases where
>> as a hyperscaler you just "know" from the data sheet?
>
>As far as I know, this is an inconsequential spec detail that is not
>being considered by any applications testing this. And yet, the expected
>imrpovements are still there, so I don't see a point holding this up for
>that reason.

I am re-reading the thread to find a common ground. I realize that my
last email came sharper than I intended. I apologize for that Christoph.

Let me summarize and propose something constructive so that we can move
forward.

It is clear that you have a lot of insight on how a FS API should look
like to unify the different data placement alternatives across
protocols. I have not seen code for it, but based on the technical
feedback you are providing, it seems to be fairly clear in your mind.

I think we should attempt to pursue that with an example in mind. Seems
XFS is the clear candidate. You have done work already in enable SMR
HDDs; it seems we can get FDP under that umbrella. This will however
take time to get right. We can help with development, testing, and
experimental evaluation on the WAF benefits for such an interface.

However, this work should not block existing hardware enabling an
existing use-case. The current patches are not intrusive. They do not
make changse to the API and merely wire up what is there to the driver.
Anyone using temperaturs will be able to use FDP - this is a win without
a maintainance burden attached to it. The change to the FS / application
API will not require major changes either; I believe we all agree that
we cannot remove the temperatures, so all existing temperature users
will be able to continue using them; we will just add an alternative for
power users on the side.

So the proposal is to merge the patches as they are and commit to work
together on a new, better API for in-kernel users (FS), and for
applications (syscall, uring).

Christoph, would this be a viable way to move forward for you?

Thanks,
Javier

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  6:40                                       ` Hans Holmberg
@ 2024-10-10  7:13                                         ` Javier Gonzalez
  2024-10-10  9:20                                           ` Christoph Hellwig
  2024-10-10 10:46                                           ` Hans Holmberg
  0 siblings, 2 replies; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-10  7:13 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On 10.10.2024 08:40, Hans Holmberg wrote:
>On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote:
>>
>>
>>
>> > -----Original Message-----
>> > From: Hans Holmberg <[email protected]>
>> > Sent: Tuesday, October 8, 2024 12:07 PM
>> > To: Javier Gonzalez <[email protected]>
>> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K.
>> > Petersen <[email protected]>; Keith Busch <[email protected]>;
>> > Kanchan Joshi <[email protected]>; [email protected]; [email protected];
>> > [email protected]; [email protected]; [email protected]; [email protected];
>> > [email protected]; [email protected]; [email protected];
>> > [email protected]; [email protected]; linux-
>> > [email protected]; [email protected]; [email protected];
>> > [email protected]; [email protected]; [email protected]
>> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >
>> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]>
>> > wrote:
>> > >
>> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> > > >> So, considerign that file system _are_ able to use temperature hints and
>> > > >> actually make them work, why don't we support FDP the same way we are
>> > > >> supporting zones so that people can use it in production?
>> > > >
>> > > >Because apparently no one has tried it.  It should be possible in theory,
>> > > >but for example unless you have power of 2 reclaim unit size size it
>> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> > > >the capacity actually used.
>> > >
>> > > This is good. I think we should have at least a FS POC with data
>> > > placement support to be able to drive conclusions on how the interface
>> > > and requirements should be. Until we have that, we can support the
>> > > use-cases that we know customers are asking for, i.e., block-level hints
>> > > through the existing temperature API.
>> > >
>> > > >
>> > > >> I agree that down the road, an interface that allows hints (many more
>> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> > > >> put 100s of them to support storage node deployments. But this needs to
>> > > >> come from the users of the hints / zones / streams / etc,  not from
>> > > >> us vendors. We do not have neither details on how they deploy these
>> > > >> features at scale, nor the workloads to validate the results. Anything
>> > > >> else will probably just continue polluting the storage stack with more
>> > > >> interfaces that are not used and add to the problem of data placement
>> > > >> fragmentation.
>> > > >
>> > > >Please always mentioned what layer you are talking about.  At the syscall
>> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> > > >would obviously be a lot better, as would be communicating the zone /
>> > > >reclaim unit / etc size.
>> > >
>> > > I mean at the syscall level. But as mentioned above, we need to be very
>> > > sure that we have a clear use-case for that. If we continue seeing hints
>> > > being use in NVMe or other protocols, and the number increase
>> > > significantly, we can deal with it later on.
>> > >
>> > > >
>> > > >As an interface to a driver that doesn't natively speak temperature
>> > > >hint on the other hand it doesn't work at all.
>> > > >
>> > > >> The issue is that the first series of this patch, which is as simple as
>> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> > > >> nowhere. So the line between real technical feedback that leads to
>> > > >> a feature being merged, and technical misleading to make people be a
>> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> > > >> been down this path many times, unfortunately...
>> > > >
>> > > >Well, the previous round was the first one actually trying to address the
>> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> > > >is the kinda of contributors I have a hard time taking serious.
>> > >
>> > > I am not sure I understand what you mean in the last sentece, so I will
>> > > not respond filling blanks with a bad interpretation.
>> > >
>> > > In summary, what we are asking for is to take the patches that cover the
>> > > current use-case, and work together on what might be needed for better
>> > > FS support. For this, it seems you and Hans have a good idea of what you
>> > > want to have based on XFS. We can help review or do part of the work,
>> > > but trying to guess our way will only delay existing customers using
>> > > existing HW.
>> >
>> > After reading the whole thread, I end up wondering why we need to rush the
>> > support for a single use case through instead of putting the architecture
>> > in place for properly supporting this new type of hardware from the start
>> > throughout the stack.
>>
>> This is not a rush. We have been supporting this use case through passthru for
>> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> to move into the block layer, which is what the end user wants to do either way.
>>
>> This is though a very good point. This is why we upstreamed passthru at the
>> time; so people can experiment, validate, and upstream only when there is a
>> clear path.
>>
>> >
>> > Even for user space consumers of raw block devices, is the last version
>> > of the patch set good enough?
>> >
>> > * It severely cripples the data separation capabilities as only a handful of
>> >   data placement buckets are supported
>>
>> I could understand from your presentation at LPC, and late looking at the code that
>> is available that you have been successful at getting good results with the existing
>> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> structures is great.
>
>No, we don't map data directly to zones using lifetime hints. In fact,
>lifetime hints contribute only a
>relatively small part  (~10% extra write amp reduction, see the
>rocksdb benchmark results).
>Segregating data by file is the most important part of the data
>placement heuristic, at least
>for this type of workload.

Is this because RocksDB already does seggregation per file itself? Are
you doing something specific on XFS or using your knoledge on RocksDB to
map files with an "unwritten" protocol in the midde?

    In this context, we have collected data both using FDP natively in
    RocksDB and using the temperatures. Both look very good, because both
    are initiated by RocksDB, and the FS just passes the hints directly
    to the driver.

I ask this to understand if this is the FS responsibility or the
application's one. Our work points more to letting applications use the
hints (as the use-cases are power users, like RocksDB). I agree with you
that a FS could potentially make an improvement for legacy applications
- we have not focused much on these though, so I trust you insights on
it.

>>
>> >
>> > * It just won't work if there is more than one user application per storage
>> >   device as different applications data streams will be mixed at the nvme
>> >   driver level..
>>
>> For now this use-case is not clear. Folks working on it are using passthru. When we
>> have a more clear understanding of what is needed, we might need changes in the kernel.
>>
>> If you see a need for this on the work that you are doing, by all means, please send patches.
>> As I said at LPC, we can work together on this.
>>
>> >
>> > While Christoph has already outlined what would be desirable from a
>> > file system point of view, I don't have the answer to what would be the overall
>> > best design for FDP. I would like to say that it looks to me like we need to
>> > consider more than more than the early adoption use cases and make sure we
>> > make the most of the hardware capabilities via logical abstractions that
>> > would be compatible with a wider range of storage devices.
>> >
>> > Figuring the right way forward is tricky, but why not just let it take the time
>> > that is needed to sort this out while early users explore how to use FDP
>> > drives and share the results?
>>
>> I agree that we might need a new interface to support more hints, beyond the temperatures.
>> Or maybe not. We would not know until someone comes with a use case. We have made the
>> mistake in the past of treating internal research as upstreamable work. I know can see that
>> this simply complicates the in-kernel and user-space APIs.
>>
>> The existing API is usable and requires no changes. There is hardware. There are customers.
>> There are applications with upstream support which have been tested with passthru (the
>> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
>> not to take this in, and then we will see what new interfaces we might need in the future.
>>
>> I would much rather spend time in discussing ideas with you and others on a potential
>> future API than arguing about the validity of an _existing_ one.
>>
>
>Yes, but while FDP support could be improved later on(happy to help if
>that'll be the case),
>I'm just afraid that less work now defining the way data placement is
>exposed is going to
>result in a bigger mess later when more use cases will be considered.

Please, see the message I responded on the other thread. I hope it is a
way to move forward and actually work together on this.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-09 15:06                                         ` Keith Busch
       [not found]                                           ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com>
@ 2024-10-10  9:10                                           ` Christoph Hellwig
  1 sibling, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-10  9:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Javier González, Jens Axboe,
	Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g

On Wed, Oct 09, 2024 at 09:06:25AM -0600, Keith Busch wrote:
> > anything to the reclaim unit?  Or is this another of the cases where
> > as a hyperscaler you just "know" from the data sheet?
> 
> As far as I know, this is an inconsequential spec detail that is not
> being considered by any applications testing this. And yet, the expected
> imrpovements are still there, so I don't see a point holding this up for
> that reason.

It was the whole point of the thing, and a major source of complexity.
Although not quite for this use case Hans has numbers that aligning
application data tables / objects / etc to the underlying "erase unit"
absolutely matters.

And just to make it clear the objection here is not that we have an
an interface that doesn't take that into respect at the syscall
level.  We already have that interface and might as well make use
of that.  The problem is that the series tries to directly expose
that to the driver, freezing us into this incomplet interface forever
(assuming we have users actually picking it up).

That's why I keep insisting like a broken record that we need to get
this lower interface right first.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  7:07                                             ` Javier González
@ 2024-10-10  9:13                                               ` Christoph Hellwig
  2024-10-10 11:59                                                 ` Javier González
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-10  9:13 UTC (permalink / raw)
  To: Javier González
  Cc: Keith Busch, Christoph Hellwig, Jens Axboe, Martin K. Petersen,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote:
> I think we should attempt to pursue that with an example in mind. Seems
> XFS is the clear candidate. You have done work already in enable SMR
> HDDs; it seems we can get FDP under that umbrella. This will however
> take time to get right. We can help with development, testing, and
> experimental evaluation on the WAF benefits for such an interface.

Or ZNS SSDs for that matter.

> However, this work should not block existing hardware enabling an
> existing use-case. The current patches are not intrusive. They do not
> make changse to the API and merely wire up what is there to the driver.
> Anyone using temperaturs will be able to use FDP - this is a win without
> a maintainance burden attached to it. The change to the FS / application
> API will not require major changes either; I believe we all agree that
> we cannot remove the temperatures, so all existing temperature users
> will be able to continue using them; we will just add an alternative for
> power users on the side.

As mentioned probably close to a dozen times over this thread and it's
predecessors:  Keeping the per-file I/O hint API and mapping that to
FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
through the entire I/O stack and locking us into that is not.

> So the proposal is to merge the patches as they are and commit to work
> together on a new, better API for in-kernel users (FS), and for
> applications (syscall, uring).

And because of that the whole merge it now and fix it later unfortunately
doesn't work, as a proper implementation will regress behavior for at
least some users that only have the I/O hints API but try to emulate
real stream separation with it.  With the per-I/O hints in io_uring that
is in fact almost guaranteed.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  7:13                                         ` Javier Gonzalez
@ 2024-10-10  9:20                                           ` Christoph Hellwig
  2024-10-10 12:22                                             ` Javier Gonzalez
  2024-10-10 10:46                                           ` Hans Holmberg
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-10  9:20 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Hans Holmberg, Christoph Hellwig, Jens Axboe, Martin K. Petersen,
	Keith Busch, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected]

On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote:
> Is this because RocksDB already does seggregation per file itself? Are
> you doing something specific on XFS or using your knoledge on RocksDB to
> map files with an "unwritten" protocol in the midde?

XFS doesn't really do anything smart at all except for grouping files
with similar temperatures, but Hans can probably explain it in more
detail.  So yes, this relies on the application doing the data separation
and using the most logical vehicle for it: files.

>
>    In this context, we have collected data both using FDP natively in
>    RocksDB and using the temperatures. Both look very good, because both
>    are initiated by RocksDB, and the FS just passes the hints directly
>    to the driver.
>
> I ask this to understand if this is the FS responsibility or the
> application's one. Our work points more to letting applications use the
> hints (as the use-cases are power users, like RocksDB). I agree with you
> that a FS could potentially make an improvement for legacy applications
> - we have not focused much on these though, so I trust you insights on
> it.

As mentioned multiple times before in this thread this absolutely
depends on the abstraction level of the application.  If the application
works on a raw device without a file system it obviously needs very
low-level control.  And in my opinion passthrough is by far the best
interface for that level of control.  If the application is using a
file system there is no better basic level abstraction than a file,
which can then be enhanced with relatively small amount of additional
information going both ways: the file system telling the application
what good file sizes and write patterns are, and the application telling
the file system what files are good candidates to merge into the same
write stream if the file system has to merge multiple actively written
to files into a write stream.  Trying to do low-level per I/O hints
on top of a file system is a recipe for trouble because you now have
to entities fighting over placement control.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  7:13                                         ` Javier Gonzalez
  2024-10-10  9:20                                           ` Christoph Hellwig
@ 2024-10-10 10:46                                           ` Hans Holmberg
       [not found]                                             ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com>
  1 sibling, 1 reply; 86+ messages in thread
From: Hans Holmberg @ 2024-10-10 10:46 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <[email protected]> wrote:
>
> On 10.10.2024 08:40, Hans Holmberg wrote:
> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote:
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: Hans Holmberg <[email protected]>
> >> > Sent: Tuesday, October 8, 2024 12:07 PM
> >> > To: Javier Gonzalez <[email protected]>
> >> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K.
> >> > Petersen <[email protected]>; Keith Busch <[email protected]>;
> >> > Kanchan Joshi <[email protected]>; [email protected]; [email protected];
> >> > [email protected]; [email protected]; [email protected]; [email protected];
> >> > [email protected]; [email protected]; [email protected];
> >> > [email protected]; [email protected]; linux-
> >> > [email protected]; [email protected]; [email protected];
> >> > [email protected]; [email protected]; [email protected]
> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> >> >
> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]>
> >> > wrote:
> >> > >
> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> >> > > >> So, considerign that file system _are_ able to use temperature hints and
> >> > > >> actually make them work, why don't we support FDP the same way we are
> >> > > >> supporting zones so that people can use it in production?
> >> > > >
> >> > > >Because apparently no one has tried it.  It should be possible in theory,
> >> > > >but for example unless you have power of 2 reclaim unit size size it
> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
> >> > > >the capacity actually used.
> >> > >
> >> > > This is good. I think we should have at least a FS POC with data
> >> > > placement support to be able to drive conclusions on how the interface
> >> > > and requirements should be. Until we have that, we can support the
> >> > > use-cases that we know customers are asking for, i.e., block-level hints
> >> > > through the existing temperature API.
> >> > >
> >> > > >
> >> > > >> I agree that down the road, an interface that allows hints (many more
> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
> >> > > >> put 100s of them to support storage node deployments. But this needs to
> >> > > >> come from the users of the hints / zones / streams / etc,  not from
> >> > > >> us vendors. We do not have neither details on how they deploy these
> >> > > >> features at scale, nor the workloads to validate the results. Anything
> >> > > >> else will probably just continue polluting the storage stack with more
> >> > > >> interfaces that are not used and add to the problem of data placement
> >> > > >> fragmentation.
> >> > > >
> >> > > >Please always mentioned what layer you are talking about.  At the syscall
> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
> >> > > >would obviously be a lot better, as would be communicating the zone /
> >> > > >reclaim unit / etc size.
> >> > >
> >> > > I mean at the syscall level. But as mentioned above, we need to be very
> >> > > sure that we have a clear use-case for that. If we continue seeing hints
> >> > > being use in NVMe or other protocols, and the number increase
> >> > > significantly, we can deal with it later on.
> >> > >
> >> > > >
> >> > > >As an interface to a driver that doesn't natively speak temperature
> >> > > >hint on the other hand it doesn't work at all.
> >> > > >
> >> > > >> The issue is that the first series of this patch, which is as simple as
> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
> >> > > >> nowhere. So the line between real technical feedback that leads to
> >> > > >> a feature being merged, and technical misleading to make people be a
> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
> >> > > >> been down this path many times, unfortunately...
> >> > > >
> >> > > >Well, the previous round was the first one actually trying to address the
> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> >> > > >is the kinda of contributors I have a hard time taking serious.
> >> > >
> >> > > I am not sure I understand what you mean in the last sentece, so I will
> >> > > not respond filling blanks with a bad interpretation.
> >> > >
> >> > > In summary, what we are asking for is to take the patches that cover the
> >> > > current use-case, and work together on what might be needed for better
> >> > > FS support. For this, it seems you and Hans have a good idea of what you
> >> > > want to have based on XFS. We can help review or do part of the work,
> >> > > but trying to guess our way will only delay existing customers using
> >> > > existing HW.
> >> >
> >> > After reading the whole thread, I end up wondering why we need to rush the
> >> > support for a single use case through instead of putting the architecture
> >> > in place for properly supporting this new type of hardware from the start
> >> > throughout the stack.
> >>
> >> This is not a rush. We have been supporting this use case through passthru for
> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
> >> to move into the block layer, which is what the end user wants to do either way.
> >>
> >> This is though a very good point. This is why we upstreamed passthru at the
> >> time; so people can experiment, validate, and upstream only when there is a
> >> clear path.
> >>
> >> >
> >> > Even for user space consumers of raw block devices, is the last version
> >> > of the patch set good enough?
> >> >
> >> > * It severely cripples the data separation capabilities as only a handful of
> >> >   data placement buckets are supported
> >>
> >> I could understand from your presentation at LPC, and late looking at the code that
> >> is available that you have been successful at getting good results with the existing
> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
> >> structures is great.
> >
> >No, we don't map data directly to zones using lifetime hints. In fact,
> >lifetime hints contribute only a
> >relatively small part  (~10% extra write amp reduction, see the
> >rocksdb benchmark results).
> >Segregating data by file is the most important part of the data
> >placement heuristic, at least
> >for this type of workload.
>
> Is this because RocksDB already does seggregation per file itself? Are
> you doing something specific on XFS or using your knoledge on RocksDB to
> map files with an "unwritten" protocol in the midde?

Data placement by-file is based on that the lifetime of a file's data
blocks are strongly correlated. When a file is deleted, all its blocks
will be reclaimable at that point. This requires knowledge about the
data placement buckets and works really well without any hints
provided.
The life-time hint heuristic I added on top is based on rocksdb
statistics, but designed to be generic enough to work for a wider
range of workloads (still need to validate this though - more work to
be done).

>
>     In this context, we have collected data both using FDP natively in
>     RocksDB and using the temperatures. Both look very good, because both
>     are initiated by RocksDB, and the FS just passes the hints directly
>     to the driver.
>
> I ask this to understand if this is the FS responsibility or the
> application's one. Our work points more to letting applications use the
> hints (as the use-cases are power users, like RocksDB). I agree with you
> that a FS could potentially make an improvement for legacy applications
> - we have not focused much on these though, so I trust you insights on
> it.

The big problem as I see it is that if applications are going to work
well together on the same media we need a common placement
implementation somewhere, and it seems pretty natural to make it part
of filesystems to me.


>
> >>
> >> >
> >> > * It just won't work if there is more than one user application per storage
> >> >   device as different applications data streams will be mixed at the nvme
> >> >   driver level..
> >>
> >> For now this use-case is not clear. Folks working on it are using passthru. When we
> >> have a more clear understanding of what is needed, we might need changes in the kernel.
> >>
> >> If you see a need for this on the work that you are doing, by all means, please send patches.
> >> As I said at LPC, we can work together on this.
> >>
> >> >
> >> > While Christoph has already outlined what would be desirable from a
> >> > file system point of view, I don't have the answer to what would be the overall
> >> > best design for FDP. I would like to say that it looks to me like we need to
> >> > consider more than more than the early adoption use cases and make sure we
> >> > make the most of the hardware capabilities via logical abstractions that
> >> > would be compatible with a wider range of storage devices.
> >> >
> >> > Figuring the right way forward is tricky, but why not just let it take the time
> >> > that is needed to sort this out while early users explore how to use FDP
> >> > drives and share the results?
> >>
> >> I agree that we might need a new interface to support more hints, beyond the temperatures.
> >> Or maybe not. We would not know until someone comes with a use case. We have made the
> >> mistake in the past of treating internal research as upstreamable work. I know can see that
> >> this simply complicates the in-kernel and user-space APIs.
> >>
> >> The existing API is usable and requires no changes. There is hardware. There are customers.
> >> There are applications with upstream support which have been tested with passthru (the
> >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
> >> not to take this in, and then we will see what new interfaces we might need in the future.
> >>
> >> I would much rather spend time in discussing ideas with you and others on a potential
> >> future API than arguing about the validity of an _existing_ one.
> >>
> >
> >Yes, but while FDP support could be improved later on(happy to help if
> >that'll be the case),
> >I'm just afraid that less work now defining the way data placement is
> >exposed is going to
> >result in a bigger mess later when more use cases will be considered.
>
> Please, see the message I responded on the other thread. I hope it is a
> way to move forward and actually work together on this.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  9:13                                               ` Christoph Hellwig
@ 2024-10-10 11:59                                                 ` Javier González
  2024-10-11  9:02                                                   ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier González @ 2024-10-10 11:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare,
	sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 10.10.2024 11:13, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote:
>> I think we should attempt to pursue that with an example in mind. Seems
>> XFS is the clear candidate. You have done work already in enable SMR
>> HDDs; it seems we can get FDP under that umbrella. This will however
>> take time to get right. We can help with development, testing, and
>> experimental evaluation on the WAF benefits for such an interface.
>
>Or ZNS SSDs for that matter.

Maybe. I do not see much movement on this.

>
>> However, this work should not block existing hardware enabling an
>> existing use-case. The current patches are not intrusive. They do not
>> make changse to the API and merely wire up what is there to the driver.
>> Anyone using temperaturs will be able to use FDP - this is a win without
>> a maintainance burden attached to it. The change to the FS / application
>> API will not require major changes either; I believe we all agree that
>> we cannot remove the temperatures, so all existing temperature users
>> will be able to continue using them; we will just add an alternative for
>> power users on the side.
>
>As mentioned probably close to a dozen times over this thread and it's
>predecessors:  Keeping the per-file I/O hint API and mapping that to
>FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>through the entire I/O stack and locking us into that is not.

I don't understand the "locking us into that" part.

>> So the proposal is to merge the patches as they are and commit to work
>> together on a new, better API for in-kernel users (FS), and for
>> applications (syscall, uring).
>
>And because of that the whole merge it now and fix it later unfortunately
>doesn't work, as a proper implementation will regress behavior for at
>least some users that only have the I/O hints API but try to emulate
>real stream separation with it.  With the per-I/O hints in io_uring that
>is in fact almost guaranteed.

I do not thing this argument is valid. These patches are not a merge now,
fix later. It is ma: use the current interface, work together on a new
one, if needed.

The new interface you are talking about is not clear to me, at all. If
you could share some patches on how that would look like, then it would
give a much clearer idea of what you have in mind. The whole idea of
let's not cover an existing use-case because we might be able to do
something in the future is not very tangible.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10  9:20                                           ` Christoph Hellwig
@ 2024-10-10 12:22                                             ` Javier Gonzalez
  2024-10-11  8:56                                               ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-10 12:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hans Holmberg, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On 10.10.2024 11:20, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote:
>> Is this because RocksDB already does seggregation per file itself? Are
>> you doing something specific on XFS or using your knoledge on RocksDB to
>> map files with an "unwritten" protocol in the midde?
>
>XFS doesn't really do anything smart at all except for grouping files
>with similar temperatures, but Hans can probably explain it in more
>detail.  So yes, this relies on the application doing the data separation
>and using the most logical vehicle for it: files.

This makes sense. Agree.

>
>>
>>    In this context, we have collected data both using FDP natively in
>>    RocksDB and using the temperatures. Both look very good, because both
>>    are initiated by RocksDB, and the FS just passes the hints directly
>>    to the driver.
>>
>> I ask this to understand if this is the FS responsibility or the
>> application's one. Our work points more to letting applications use the
>> hints (as the use-cases are power users, like RocksDB). I agree with you
>> that a FS could potentially make an improvement for legacy applications
>> - we have not focused much on these though, so I trust you insights on
>> it.
>
>As mentioned multiple times before in this thread this absolutely
>depends on the abstraction level of the application.  If the application
>works on a raw device without a file system it obviously needs very
>low-level control.  And in my opinion passthrough is by far the best
>interface for that level of control. 

Passthru is great for prototyping and getting insights on end-to-end
applicability. We see though that it is difficult to get a full solution
based on it, unless people implement a use-space layer tailored to their
use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
that can use passthru prefer to move to block - with a validated
use-case it should be easier to get things upstream.

This is exactly where we are now.

>If the application is using a
>file system there is no better basic level abstraction than a file,
>which can then be enhanced with relatively small amount of additional
>information going both ways: the file system telling the application
>what good file sizes and write patterns are, and the application telling
>the file system what files are good candidates to merge into the same
>write stream if the file system has to merge multiple actively written
>to files into a write stream.  Trying to do low-level per I/O hints
>on top of a file system is a recipe for trouble because you now have
>to entities fighting over placement control.

For file, I agree with you.

If you saw the comments from Christian on the inode space, there are a
few plumbing challenges. Do you have any patches we could look at?




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
       [not found]                                             ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com>
@ 2024-10-10 12:27                                               ` Javier Gonzalez
  2024-10-11  8:59                                                 ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-10 12:27 UTC (permalink / raw)
  To: Hans Holmberg
  Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On 10.10.2024 12:46, Hans Holmberg wrote:
>On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <[email protected]> wrote:
>>
>> On 10.10.2024 08:40, Hans Holmberg wrote:
>> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote:
>> >>
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Hans Holmberg <[email protected]>
>> >> > Sent: Tuesday, October 8, 2024 12:07 PM
>> >> > To: Javier Gonzalez <[email protected]>
>> >> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K.
>> >> > Petersen <[email protected]>; Keith Busch <[email protected]>;
>> >> > Kanchan Joshi <[email protected]>; [email protected]; [email protected];
>> >> > [email protected]; [email protected]; [email protected]; [email protected];
>> >> > [email protected]; [email protected]; [email protected];
>> >> > [email protected]; [email protected]; linux-
>> >> > [email protected]; [email protected]; [email protected];
>> >> > [email protected]; [email protected]; [email protected]
>> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >> >
>> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]>
>> >> > wrote:
>> >> > >
>> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> >> > > >> So, considerign that file system _are_ able to use temperature hints and
>> >> > > >> actually make them work, why don't we support FDP the same way we are
>> >> > > >> supporting zones so that people can use it in production?
>> >> > > >
>> >> > > >Because apparently no one has tried it.  It should be possible in theory,
>> >> > > >but for example unless you have power of 2 reclaim unit size size it
>> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> >> > > >the capacity actually used.
>> >> > >
>> >> > > This is good. I think we should have at least a FS POC with data
>> >> > > placement support to be able to drive conclusions on how the interface
>> >> > > and requirements should be. Until we have that, we can support the
>> >> > > use-cases that we know customers are asking for, i.e., block-level hints
>> >> > > through the existing temperature API.
>> >> > >
>> >> > > >
>> >> > > >> I agree that down the road, an interface that allows hints (many more
>> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> >> > > >> put 100s of them to support storage node deployments. But this needs to
>> >> > > >> come from the users of the hints / zones / streams / etc,  not from
>> >> > > >> us vendors. We do not have neither details on how they deploy these
>> >> > > >> features at scale, nor the workloads to validate the results. Anything
>> >> > > >> else will probably just continue polluting the storage stack with more
>> >> > > >> interfaces that are not used and add to the problem of data placement
>> >> > > >> fragmentation.
>> >> > > >
>> >> > > >Please always mentioned what layer you are talking about.  At the syscall
>> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> >> > > >would obviously be a lot better, as would be communicating the zone /
>> >> > > >reclaim unit / etc size.
>> >> > >
>> >> > > I mean at the syscall level. But as mentioned above, we need to be very
>> >> > > sure that we have a clear use-case for that. If we continue seeing hints
>> >> > > being use in NVMe or other protocols, and the number increase
>> >> > > significantly, we can deal with it later on.
>> >> > >
>> >> > > >
>> >> > > >As an interface to a driver that doesn't natively speak temperature
>> >> > > >hint on the other hand it doesn't work at all.
>> >> > > >
>> >> > > >> The issue is that the first series of this patch, which is as simple as
>> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> >> > > >> nowhere. So the line between real technical feedback that leads to
>> >> > > >> a feature being merged, and technical misleading to make people be a
>> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> >> > > >> been down this path many times, unfortunately...
>> >> > > >
>> >> > > >Well, the previous round was the first one actually trying to address the
>> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> >> > > >is the kinda of contributors I have a hard time taking serious.
>> >> > >
>> >> > > I am not sure I understand what you mean in the last sentece, so I will
>> >> > > not respond filling blanks with a bad interpretation.
>> >> > >
>> >> > > In summary, what we are asking for is to take the patches that cover the
>> >> > > current use-case, and work together on what might be needed for better
>> >> > > FS support. For this, it seems you and Hans have a good idea of what you
>> >> > > want to have based on XFS. We can help review or do part of the work,
>> >> > > but trying to guess our way will only delay existing customers using
>> >> > > existing HW.
>> >> >
>> >> > After reading the whole thread, I end up wondering why we need to rush the
>> >> > support for a single use case through instead of putting the architecture
>> >> > in place for properly supporting this new type of hardware from the start
>> >> > throughout the stack.
>> >>
>> >> This is not a rush. We have been supporting this use case through passthru for
>> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> >> to move into the block layer, which is what the end user wants to do either way.
>> >>
>> >> This is though a very good point. This is why we upstreamed passthru at the
>> >> time; so people can experiment, validate, and upstream only when there is a
>> >> clear path.
>> >>
>> >> >
>> >> > Even for user space consumers of raw block devices, is the last version
>> >> > of the patch set good enough?
>> >> >
>> >> > * It severely cripples the data separation capabilities as only a handful of
>> >> >   data placement buckets are supported
>> >>
>> >> I could understand from your presentation at LPC, and late looking at the code that
>> >> is available that you have been successful at getting good results with the existing
>> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> >> structures is great.
>> >
>> >No, we don't map data directly to zones using lifetime hints. In fact,
>> >lifetime hints contribute only a
>> >relatively small part  (~10% extra write amp reduction, see the
>> >rocksdb benchmark results).
>> >Segregating data by file is the most important part of the data
>> >placement heuristic, at least
>> >for this type of workload.
>>
>> Is this because RocksDB already does seggregation per file itself? Are
>> you doing something specific on XFS or using your knoledge on RocksDB to
>> map files with an "unwritten" protocol in the midde?
>
>Data placement by-file is based on that the lifetime of a file's data
>blocks are strongly correlated. When a file is deleted, all its blocks
>will be reclaimable at that point. This requires knowledge about the
>data placement buckets and works really well without any hints
>provided.

But we need hints to put files together. I believe you do this already,
as no placement protocol gives you unlimited separation.

>The life-time hint heuristic I added on top is based on rocksdb
>statistics, but designed to be generic enough to work for a wider
>range of workloads (still need to validate this though - more work to
>be done).

Maybe you can post some patches on the parts dedicated to the VFS level
and user-space API (syscall or uring)?

Following on the comment to Christoph, it would be good to have
something tangible to work together on for the next stage of this
support.

>
>>
>>     In this context, we have collected data both using FDP natively in
>>     RocksDB and using the temperatures. Both look very good, because both
>>     are initiated by RocksDB, and the FS just passes the hints directly
>>     to the driver.
>>
>> I ask this to understand if this is the FS responsibility or the
>> application's one. Our work points more to letting applications use the
>> hints (as the use-cases are power users, like RocksDB). I agree with you
>> that a FS could potentially make an improvement for legacy applications
>> - we have not focused much on these though, so I trust you insights on
>> it.
>
>The big problem as I see it is that if applications are going to work
>well together on the same media we need a common placement
>implementation somewhere, and it seems pretty natural to make it part
>of filesystems to me.

For FS users, makes a lot of sense. But we still need to cover
applications using raw block.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10 12:22                                             ` Javier Gonzalez
@ 2024-10-11  8:56                                               ` Christoph Hellwig
  2024-10-11 12:21                                                 ` Javier Gonzalez
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-11  8:56 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Hans Holmberg, Jens Axboe, Martin K. Petersen,
	Keith Busch, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected]

On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote:
> Passthru is great for prototyping and getting insights on end-to-end
> applicability. We see though that it is difficult to get a full solution
> based on it, unless people implement a use-space layer tailored to their
> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
> that can use passthru prefer to move to block - with a validated
> use-case it should be easier to get things upstream.
>
> This is exactly where we are now.

That's a lot of marketing babble :)    What exact thing is missing
from the passthrough interface when using say spdx over io_uring?

> If you saw the comments from Christian on the inode space, there are a
> few plumbing challenges. Do you have any patches we could look at?

I'm not sure what you refer to here.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10 12:27                                               ` Javier Gonzalez
@ 2024-10-11  8:59                                                 ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-11  8:59 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Hans Holmberg, Christoph Hellwig, Jens Axboe, Martin K. Petersen,
	Keith Busch, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected]

On Thu, Oct 10, 2024 at 02:27:33PM +0200, Javier Gonzalez wrote:

[full quote snipped, it would be really helpful to only quote what you
actuall reference per the usual email rules]

>> Data placement by-file is based on that the lifetime of a file's data
>> blocks are strongly correlated. When a file is deleted, all its blocks
>> will be reclaimable at that point. This requires knowledge about the
>> data placement buckets and works really well without any hints
>> provided.
>
> But we need hints to put files together. I believe you do this already,
> as no placement protocol gives you unlimited separation.

The per-file temperature hints do a reasonable job for that.  A fully
developed version of the separate write streams submitted by Kanchan
would probably do even better, but for now the per-file hints seem
to be enough.

> Maybe you can post some patches on the parts dedicated to the VFS level
> and user-space API (syscall or uring)?

It's just using the existing temperature hints.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-10 11:59                                                 ` Javier González
@ 2024-10-11  9:02                                                   ` Christoph Hellwig
  2024-10-11 17:08                                                     ` Jens Axboe
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-11  9:02 UTC (permalink / raw)
  To: Javier González
  Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Martin K. Petersen,
	Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl,
	dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel,
	io-uring, linux-block, linux-aio, gost.dev, vishak.g

On Thu, Oct 10, 2024 at 01:59:14PM +0200, Javier González wrote:
>> As mentioned probably close to a dozen times over this thread and it's
>> predecessors:  Keeping the per-file I/O hint API and mapping that to
>> FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>> through the entire I/O stack and locking us into that is not.
>
> I don't understand the "locking us into that" part.

The patches as submitted do the two following two things:

 1) interpret the simple temperature hints to map to FDP reclaim handles
 2) add a new interface to set the temperature hints per I/O

and also rely on an annoying existing implementation detail where the I/O
hints set on random files get automatically propagated to the block
device without file system involvement.

This means we can't easily make the nvme driver actually use smarter
hints that expose the actual FDP capabilities without breaking users
that relied on the existing behavior, especially the per-I/O hints that
counteract any kind of file system based data placement.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-11  8:56                                               ` Christoph Hellwig
@ 2024-10-11 12:21                                                 ` Javier Gonzalez
  2024-10-11 16:59                                                   ` Keith Busch
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-11 12:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hans Holmberg, Jens Axboe, Christian Brauner, Martin K. Petersen,
	Keith Busch, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On 11.10.2024 10:56, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote:
>> Passthru is great for prototyping and getting insights on end-to-end
>> applicability. We see though that it is difficult to get a full solution
>> based on it, unless people implement a use-space layer tailored to their
>> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
>> that can use passthru prefer to move to block - with a validated
>> use-case it should be easier to get things upstream.
>>
>> This is exactly where we are now.
>
>That's a lot of marketing babble :)    What exact thing is missing
>from the passthrough interface when using say spdx over io_uring?

The block layer provides a lot of functionality that passthru cannot
provide. A simple example would be splits. You know this :)

I am sure Jens and Keith can give you more specifics on their particular
reasons.

>
>> If you saw the comments from Christian on the inode space, there are a
>> few plumbing challenges. Do you have any patches we could look at?
>
>I'm not sure what you refer to here.

This from Christian:
   https://lore.kernel.org/all/20240903-erfassen-bandmitglieder-32dfaeee66b2@brauner/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-11 12:21                                                 ` Javier Gonzalez
@ 2024-10-11 16:59                                                   ` Keith Busch
  0 siblings, 0 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-11 16:59 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Hans Holmberg, Jens Axboe, Christian Brauner,
	Martin K. Petersen, Kanchan Joshi, [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Fri, Oct 11, 2024 at 02:21:02PM +0200, Javier Gonzalez wrote:
> > That's a lot of marketing babble :)    What exact thing is missing
> > from the passthrough interface when using say spdx over io_uring?
> 
> The block layer provides a lot of functionality that passthru cannot
> provide. A simple example would be splits. You know this :)
> 
> I am sure Jens and Keith can give you more specifics on their particular
> reasons.

Splitting, merging, cgroups, iostats, error retries, failover.

We have a recent change staged in 6.13 to try to count iostats on
passthrough, but it doesn't do discards.

Passthrough to partitions requires root access and the slow CAP_SYSADMIN
check on every IO.

Page cache. Though that may not readily work with this io_uring proposal
as-is either.

The generic IO path is safe to changing formats; passthrough isn't.

And it's just generally more work to port existing applications to use
the passthrough interface. It causes duplicating solutions to scenarios
the kernel already handles.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-11  9:02                                                   ` Christoph Hellwig
@ 2024-10-11 17:08                                                     ` Jens Axboe
  2024-10-14  6:21                                                       ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Jens Axboe @ 2024-10-11 17:08 UTC (permalink / raw)
  To: Christoph Hellwig, Javier González
  Cc: Keith Busch, Martin K. Petersen, Kanchan Joshi, hare, sagi,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g

On 10/11/24 3:02 AM, Christoph Hellwig wrote:
>>> As mentioned probably close to a dozen times over this thread and it's
>>> predecessors:  Keeping the per-file I/O hint API and mapping that to
>>> FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>>> through the entire I/O stack and locking us into that is not.
>>
>> I don't understand the "locking us into that" part.
> 
> The patches as submitted do the two following two things:
> 
>  1) interpret the simple temperature hints to map to FDP reclaim handles
>  2) add a new interface to set the temperature hints per I/O
> 
> and also rely on an annoying existing implementation detail where the I/O
> hints set on random files get automatically propagated to the block
> device without file system involvement.
> 
> This means we can't easily make the nvme driver actually use smarter
> hints that expose the actual FDP capabilities without breaking users
> that relied on the existing behavior, especially the per-I/O hints that
> counteract any kind of file system based data placement.
> 

I think that last argument is a straw man - for any kind of interface
like this, we've ALWAYS just had the rule that any per-whatever
overrides the generic setting. Per IO hints would do the same thing. Is
this a mess if a user has assigned a file hint and uses other per IO
hints on that very same file and they differ? Certainly, but that's
really just the user/app being dumb. Or maybe being smart, as this
particular one block is always hot in the file (yeah contrived case).

The point is, if you mix and match per-io hints and per-file hints, then
you're probably being pretty silly and nobody should do that. I don't
think this is a practical concern at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-11 17:08                                                     ` Jens Axboe
@ 2024-10-14  6:21                                                       ` Christoph Hellwig
  2024-10-14  7:02                                                         ` Javier Gonzalez
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-14  6:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Javier González, Keith Busch,
	Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g

On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote:
> 
> I think that last argument is a straw man - for any kind of interface
> like this, we've ALWAYS just had the rule that any per-whatever
> overrides the generic setting.

And exactly that is the problem.  For file systems we can't support
that sanely.  So IFF you absolutely want the per-I/O hints we need
an opt in by the file operations.  I've said that at least twice
in this discussion before, but as everyone likes to have political
discussions instead of technical ones no one replied to that.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-14  6:21                                                       ` Christoph Hellwig
@ 2024-10-14  7:02                                                         ` Javier Gonzalez
  2024-10-14  7:47                                                           ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-14  7:02 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected]

> -----Original Message-----
> From: Christoph Hellwig <[email protected]>
> Sent: Monday, October 14, 2024 8:21 AM
> To: Jens Axboe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Javier Gonzalez <[email protected]>;
> Keith Busch <[email protected]>; Martin K. Petersen
> <[email protected]>; Kanchan Joshi <[email protected]>;
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote:
> >
> > I think that last argument is a straw man - for any kind of interface
> > like this, we've ALWAYS just had the rule that any per-whatever
> > overrides the generic setting.
> 
> And exactly that is the problem.  For file systems we can't support
> that sanely.  So IFF you absolutely want the per-I/O hints we need
> an opt in by the file operations.  I've said that at least twice
> in this discussion before, but as everyone likes to have political
> discussions instead of technical ones no one replied to that.

Is it a way forward to add this in a new spin of the series - keeping the 
temperature mapping on the NVMe side?

If not, what would be acceptable for a first version, before getting into adding
a new interface to expose agnostic hints?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-14  7:02                                                         ` Javier Gonzalez
@ 2024-10-14  7:47                                                           ` Christoph Hellwig
  2024-10-14  9:08                                                             ` Javier Gonzalez
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-14  7:47 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote:
> > And exactly that is the problem.  For file systems we can't support
> > that sanely.  So IFF you absolutely want the per-I/O hints we need
> > an opt in by the file operations.  I've said that at least twice
> > in this discussion before, but as everyone likes to have political
> > discussions instead of technical ones no one replied to that.
> 
> Is it a way forward to add this in a new spin of the series - keeping the 
> temperature mapping on the NVMe side?

What do you gain from that?  NVMe does not understand data temperatures,
so why make up that claim?  Especially as it directly undermindes any
file system work to actually make use of it.

> If not, what would be acceptable for a first version, before getting into adding
> a new interface to expose agnostic hints?

Just iterate on Kanchan's series for a block layer (and possible user level,
but that's less urgent) interface for stream separation?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-14  7:47                                                           ` Christoph Hellwig
@ 2024-10-14  9:08                                                             ` Javier Gonzalez
  2024-10-14 11:50                                                               ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-14  9:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi,
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

> -----Original Message-----
> From: Christoph Hellwig <[email protected]>
> Sent: Monday, October 14, 2024 9:47 AM
> To: Javier Gonzalez <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Keith Busch
> <[email protected]>; Martin K. Petersen <[email protected]>; Kanchan
> Joshi <[email protected]>; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote:
> > > And exactly that is the problem.  For file systems we can't support
> > > that sanely.  So IFF you absolutely want the per-I/O hints we need
> > > an opt in by the file operations.  I've said that at least twice
> > > in this discussion before, but as everyone likes to have political
> > > discussions instead of technical ones no one replied to that.
> >
> > Is it a way forward to add this in a new spin of the series - keeping the
> > temperature mapping on the NVMe side?
> 
> What do you gain from that?  NVMe does not understand data temperatures,
> so why make up that claim?  

I expressed this a couple of times in this thread. It is no problem to map temperatures
to a protocol that does not understand the semantics. It is not perfect, and in time
I agree that we would benefit from exposing the raw hints without semantics from
Block layer upwards.

But the temperatures are already there. We are not adding anything new. And thanks
to this, we can enable existing users with _minimal_ changes to user-space. As pointed
on the XFS zoned discussions, this is very similar to what you guys are doing (exactly to
re-use what it is already there), and it works. On top of this,  applications that already
understand temperatures (e.g., RocksDB) will be able to leverage FDP SSDs without changes.

> Especially as it directly undermindes any file system work to actually make use of it.

I do not think it does. If a FS wants to use the temperatures, then they would be able to leverage
FDP besides SCSI. And if we come up with a better interface later on, we can make the changes then.
I really do not see the issue. If we were adding a temperature abstraction now, I would agree with
You that we would need to cover the use-case you mention for FSs from the beginning, but this
Is already here. Seems like a fair compromise to support current users.


> 
> > If not, what would be acceptable for a first version, before getting into adding
> > a new interface to expose agnostic hints?
> 
> Just iterate on Kanchan's series for a block layer (and possible user level,
> but that's less urgent) interface for stream separation?

We can work on this later, but it is not that easy. Look at the mail I forwarded 
Form Christian. We now can store hints in the same structure as the temperatures. 
I believe that we would be able to support 128 hints. If we are serious about supporting 
hints as a general interface, 128 hints is not nearly enough. Moreover:

  - How do we convince VFS folks to give us more space for hints at this point?
  - How do we make agnostic hints and temperature co-exist? Do we use a bit to select 
     and create a protocol? This seems over-complicated
  - When we are exposing general hints to the block layer, how do we support N:N 
     FS:Block_Devices? and DMs? This is not obvious, and it goes well beyond what we
     need today, but we probably need to think about it.
 
And these are just questions that we know. You see that this is a big lift (which we 
can work together on!), but will delay current users by months. And honestly, I do
not feel comfortable doing this alone; we need application and  FS folks like you to 
help guide this so that it is really usable.

I say it again: Your idea about properly supporting hints in FS is good, and I think we should
work on it. But I do not believe that the current patches (with added things like opt in file ops) 
get in the way of this.  So if the disagreement is if these patches are harmful, let's talk about 
this explicitly and try to agree; we do not need to go over the need for a new hint interface. 
We already agree on this.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-14  9:08                                                             ` Javier Gonzalez
@ 2024-10-14 11:50                                                               ` Christoph Hellwig
  2024-10-15  3:07                                                                 ` Javier Gonzalez
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-14 11:50 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:

[can you fix your mailer please, no full quotes, and especially not
quotes of the mail headers]

> > What do you gain from that?  NVMe does not understand data temperatures,
> > so why make up that claim?  
> 
> I expressed this a couple of times in this thread. It is no problem to
> map temperatures to a protocol that does not understand the semantics.

And I've agreed every time with you.  But the important point is that we
must not do it in the driver where all context is lost.

> > Especially as it directly undermindes any file system work to actually make use of it.
> 
> I do not think it does. If a FS wants to use the temperatures, then they
> would be able to leverage FDP besides SCSI.

What do you mean with that?  This is a bit too much whitepaper vocabularly.

We have code in XFS that can make use of the temperature hint.  But to
make them work it actually needs to do real stream separation on the
device.  I.e. the file system consumes the temperature hints.


> And if we come up with a better interface later on, we can make the changes then.
> I really do not see the issue. If we were adding a temperature abstraction now, I would agree with
> You that we would need to cover the use-case you mention for FSs from the beginning, but this
> Is already here. Seems like a fair compromise to support current users.

Again, I think the temperature hints at the syscall level aren't all
bad.  There's definitively a few things I'd like to do better in hindsight,
but that's not the point.  The problem is trying to turn them into
stream separation all the way down in the driver, which is fundamentally
broken.

>   - How do we convince VFS folks to give us more space for hints at this point?

What space from VFS folks do you need for hints?  And why does it
matter?


^ permalink raw reply	[flat|nested] 86+ messages in thread

* RE: [PATCH v7 0/3] FDP and per-io hints
  2024-10-14 11:50                                                               ` Christoph Hellwig
@ 2024-10-15  3:07                                                                 ` Javier Gonzalez
  2024-10-15  5:30                                                                   ` Christoph Hellwig
  0 siblings, 1 reply; 86+ messages in thread
From: Javier Gonzalez @ 2024-10-15  3:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi,
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

> On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:
> > > Especially as it directly undermindes any file system work to actually make use
> of it.
> >
> > I do not think it does. If a FS wants to use the temperatures, then they
> > would be able to leverage FDP besides SCSI.
> 
> What do you mean with that?  This is a bit too much whitepaper vocabularly.
> 
> We have code in XFS that can make use of the temperature hint.  But to
> make them work it actually needs to do real stream separation on the
> device.  I.e. the file system consumes the temperature hints.

The device can guarantee the stream separation without knowing the temperature.

> > And if we come up with a better interface later on, we can make the changes
> then.
> > I really do not see the issue. If we were adding a temperature abstraction now, I
> would agree with
> > You that we would need to cover the use-case you mention for FSs from the
> beginning, but this
> > Is already here. Seems like a fair compromise to support current users.
> 
> Again, I think the temperature hints at the syscall level aren't all
> bad.  There's definitively a few things I'd like to do better in hindsight,
> but that's not the point.  The problem is trying to turn them into
> stream separation all the way down in the driver, which is fundamentally
> broken.
> 
> >   - How do we convince VFS folks to give us more space for hints at this point?
> 
> What space from VFS folks do you need for hints?  And why does it
> matter?

We need space in the inode to store the hint ID.

Look, this feels like going in circles. All this gaslighting is what makes it difficult to 
push patches when you just do not like the feature. It is the 3rd time I propose you 
a way forward and you simply cannot provide any specific technical feedback - in the 
past email I posted several questions about the interface you seem to be talking 
about and you explicitly omit that.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-15  3:07                                                                 ` Javier Gonzalez
@ 2024-10-15  5:30                                                                   ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-15  5:30 UTC (permalink / raw)
  To: Javier Gonzalez
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen,
	Kanchan Joshi, [email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected]

On Tue, Oct 15, 2024 at 03:07:57AM +0000, Javier Gonzalez wrote:
> > On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:
> > > > Especially as it directly undermindes any file system work to actually make use
> > of it.
> > >
> > > I do not think it does. If a FS wants to use the temperatures, then they
> > > would be able to leverage FDP besides SCSI.
> > 
> > What do you mean with that?  This is a bit too much whitepaper vocabularly.
> > 
> > We have code in XFS that can make use of the temperature hint.  But to
> > make them work it actually needs to do real stream separation on the
> > device.  I.e. the file system consumes the temperature hints.
> 
> The device can guarantee the stream separation without knowing the temperature.

Of course.  But that's entirely not the point.

> > What space from VFS folks do you need for hints?  And why does it
> > matter?
> 
> We need space in the inode to store the hint ID.
> 
> Look, this feels like going in circles. All this gaslighting is what makes it difficult to 

I'm so fucking tired of this.  You are not even listening to my arguments
at all, even when explaing every detail to you again and again.  And then
you acuse me of "gaslighting". 

> push patches when you just do not like the feature. It is the 3rd time I propose you 
> a way forward and you simply cannot provide any specific technical feedback - in the 
> past email I posted several questions about the interface you seem to be talking 
> about and you explicitly omit that.

???

You are throwing random buzzwords and not even replying to all the
low level technical points.  If you think I missed something please
just ask again instead of this crap.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
                     ` (4 preceding siblings ...)
  2024-10-03  0:20   ` Bart Van Assche
@ 2024-10-15  5:50   ` Christoph Hellwig
  2024-10-15 15:09     ` Keith Busch
  2024-10-17 14:35     ` Kanchan Joshi
  5 siblings, 2 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-15  5:50 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

As the current discussion is probably too tiring for anyone who is not a
professional mud fighter I'll summarize my objections and comments here
once again in a way that even aspiring middle managers should be able
understand.

1) While the current per-file temperature hints interface is not perfect
it is okay and make sense to reuse until we need something more fancy.
We make good use of it in f2fs and the upcoming zoned xfs code to help
with data placement and have numbers to show that it helps.

2) A per-I/O interface to set these temperature hint conflicts badly
with how placement works in file systems.  If we have an urgent need
for it on the block device it needs to be opt-in by the file operations
so it can be enabled on block device, but not on file systems by
default.  This way you can implement it for block device, but not
provide it on file systems by default.  If a given file system finds
a way to implement it it can still opt into implementing it of course.

3) Mapping from temperature hints to separate write streams needs to
happen above the block layer, because file systems need to be in
control of it to do intelligent placement.  That means if you want to
map from temperature hints to stream separation it needs to be
implemented at the file operation layer, not in the device driver.
The mapping implemented in this series is probably only useful for
block devices.  Maybe if dumb file systems want to adopt it, it could
be split into library code for reuse, but as usual that's probably
best done only when actually needed.

4) To support this the block layer, that is bios and requests need
to support a notion of stream separation.   Kanchan's previous series
had most of the bits for that, it just needs to be iterated on.

All of this could have probably be easily done in the time spent on
this discussion.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-15  5:50   ` Christoph Hellwig
@ 2024-10-15 15:09     ` Keith Busch
  2024-10-15 15:22       ` Christoph Hellwig
  2024-10-17 14:35     ` Kanchan Joshi
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2024-10-15 15:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote:
> 1) While the current per-file temperature hints interface is not perfect
> it is okay and make sense to reuse until we need something more fancy.
> We make good use of it in f2fs and the upcoming zoned xfs code to help
> with data placement and have numbers to show that it helps.

So we're okay to proceed with patch 1?
 
> 2) A per-I/O interface to set these temperature hint conflicts badly
> with how placement works in file systems.  If we have an urgent need
> for it on the block device it needs to be opt-in by the file operations
> so it can be enabled on block device, but not on file systems by
> default.  This way you can implement it for block device, but not
> provide it on file systems by default.  If a given file system finds
> a way to implement it it can still opt into implementing it of course.

If we add a new fop_flag that only block fops enables, then it's okay?

> 3) Mapping from temperature hints to separate write streams needs to
> happen above the block layer, because file systems need to be in
> control of it to do intelligent placement.  That means if you want to
> map from temperature hints to stream separation it needs to be
> implemented at the file operation layer, not in the device driver.
> The mapping implemented in this series is probably only useful for
> block devices.  Maybe if dumb file systems want to adopt it, it could
> be split into library code for reuse, but as usual that's probably
> best done only when actually needed.

IMO, I don't even think the io_uring per-io hint needs to be limited to
the fcntl lifetime values. It could just be a u16 value opaque to the
block layer that just gets forwarded to the device.

> 4) To support this the block layer, that is bios and requests need
> to support a notion of stream separation.   Kanchan's previous series
> had most of the bits for that, it just needs to be iterated on.
> 
> All of this could have probably be easily done in the time spent on
> this discussion.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-15 15:09     ` Keith Busch
@ 2024-10-15 15:22       ` Christoph Hellwig
  0 siblings, 0 replies; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-15 15:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Tue, Oct 15, 2024 at 09:09:20AM -0600, Keith Busch wrote:
> On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote:
> > 1) While the current per-file temperature hints interface is not perfect
> > it is okay and make sense to reuse until we need something more fancy.
> > We make good use of it in f2fs and the upcoming zoned xfs code to help
> > with data placement and have numbers to show that it helps.
> 
> So we're okay to proceed with patch 1?

No, see point 3 and 4 below for why.  We'll need something like the
interface you suggested by me in point 4 and by you in reply to point 3
in the block layer, and then block/fops.c can implement the mapping on
top of that for drivers supporting it.

>  
> > 2) A per-I/O interface to set these temperature hint conflicts badly
> > with how placement works in file systems.  If we have an urgent need
> > for it on the block device it needs to be opt-in by the file operations
> > so it can be enabled on block device, but not on file systems by
> > default.  This way you can implement it for block device, but not
> > provide it on file systems by default.  If a given file system finds
> > a way to implement it it can still opt into implementing it of course.
> 
> If we add a new fop_flag that only block fops enables, then it's okay?

The flag is just one part of it.  Of course it need to be discoverable
from userspace in one way or another, and the marshalling of the flag
needs to be controller by the file system / fops instance.

> > 3) Mapping from temperature hints to separate write streams needs to
> > happen above the block layer, because file systems need to be in
> > control of it to do intelligent placement.  That means if you want to
> > map from temperature hints to stream separation it needs to be
> > implemented at the file operation layer, not in the device driver.
> > The mapping implemented in this series is probably only useful for
> > block devices.  Maybe if dumb file systems want to adopt it, it could
> > be split into library code for reuse, but as usual that's probably
> > best done only when actually needed.
> 
> IMO, I don't even think the io_uring per-io hint needs to be limited to
> the fcntl lifetime values. It could just be a u16 value opaque to the
> block layer that just gets forwarded to the device.

Well, that's what I've been arguing for all the time, and what Kanchan's
previous series was working towards.  It's not quite as trivial as
we need a bit more than just the stream, e.g. a way to discover how many
of them exist.

> > 4) To support this the block layer, that is bios and requests need
> > to support a notion of stream separation.   Kanchan's previous series
> > had most of the bits for that, it just needs to be iterated on.
> > 
> > All of this could have probably be easily done in the time spent on
> > this discussion.
---end quoted text---

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-15  5:50   ` Christoph Hellwig
  2024-10-15 15:09     ` Keith Busch
@ 2024-10-17 14:35     ` Kanchan Joshi
  2024-10-17 15:23       ` Christoph Hellwig
  1 sibling, 1 reply; 86+ messages in thread
From: Kanchan Joshi @ 2024-10-17 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

Seems per-I/O hints are not getting the love they deserve.
Apart from the block device, the usecase is when all I/Os of VM (or 
container) are to be grouped together or placed differently.

Per-IO hints are fine-granular (enough for userspace to build 
per-process/vm/file/whatever etc.) and add the flexibility we have 
lacked so far.
As for conflict, I doubt if that exists. Please see below:

> 2) A per-I/O interface to set these temperature hint conflicts badly
> with how placement works in file systems.  If we have an urgent need
> for it on the block device it needs to be opt-in by the file operations
> so it can be enabled on block device, but not on file systems by
> default.  This way you can implement it for block device, but not
> provide it on file systems by default.  If a given file system finds
> a way to implement it it can still opt into implementing it of course.

Why do you see this as something that is so different across filesystems 
that they would need to "find a way to implement"?
Both per-file and per-io hints are supplied by userspace. Inode and 
kiocb only happen to be the mean to receive the hint information.
FS is free to use this information (iff it wants) or simply forward this 
down.

Per-file hint just gets stored (within inode) without individual FS 
involvement. Per-io hint follows the same model (i.e., it is set by 
upper layer like io_uring/aio) and uses kiocb to store the hint. It does 
not alter the stored inode hint value!

After patch #3, filesystems have both per-file and per-io information 
available. And as before, they can use that hint info (from kiocb or 
inode) and/or simply forward.

The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) 
already forwards the incoming hints, without any intelligence.
And we need just that because with user-passed hints, the onus of 
intelligence is on the userspace. This is how hints work on 
xfs/ext4/btrfs files at this point.

The owner of the file (filesystem, block, whatever) can still use the 
incoming hints to do any custom/extra feature. Either from inode or from 
kiocb depending on what information (per-file or per-io hint) it finds 
more useful for that custom feature. For example, F2FS is using 
per-inode information to do something custom and that part has not been 
changed by these patches.

Overall, I do not see the conflict. It's all user-driven. No?

For the currently uncommon case when FS decides to generate its own 
hints (as opposed to userspace hint consumer/forwarder), it can directly 
put the hint value in bio.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability
  2024-10-02 14:26       ` Pavel Begunkov
@ 2024-10-17 14:58         ` Kanchan Joshi
  0 siblings, 0 replies; 86+ messages in thread
From: Kanchan Joshi @ 2024-10-17 14:58 UTC (permalink / raw)
  To: Pavel Begunkov, axboe, kbusch, hch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 10/2/2024 7:56 PM, Pavel Begunkov wrote:
> On 9/30/24 19:13, Kanchan Joshi wrote:
>> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
>> all the subsequent writes on the file pass that hint value down.
>> This can be limiting for large files (and for block device) as all the
>> writes can be tagged with only one lifetime hint value.
>> Concurrent writes (with different hint values) are hard to manage.
>> Per-IO hinting solves that problem.
>>
>> Allow userspace to pass additional metadata in the SQE.
>> The type of passed metadata is expressed by a new field
>>
>>     __u16 meta_type;
> 
> The new layout looks nicer, but let me elaborate on the previous
> comment. I don't believe we should be restricting to only one
> attribute per IO. What if someone wants to pass a lifetime hint
> together with integrity information?

For that reason only I made meta_type to accept multiple bit values.
META_TYPE_LIFETIME_HINT and a new META_TYPE_INTEGRITY can coexist.
Overall 16 meta types can coexist.

> Instead, we might need something more extensible like an ability
> to pass a list / array of typed attributes / meta information / hints
> etc. An example from networking I gave last time was control messages,
> i.e. cmsg. In a basic oversimplified form the API from the user
> perspective could look like:
> 
> struct meta_attr {
>      u16 type;
>      u64 data;
> };
> 
> struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}};
> sqe->meta_attrs = attr;
> sqe->meta_nr = 2;


I did not feel like adding a pointer (and have copy_from_user cost) for 
integrity. Currently integrity uses space in second SQE which seems fine 
[*].
Down the line if meta-types increase and we are on verge of low SQE 
space, we can resort to add indirect reference.

[*] 
https://lore.kernel.org/linux-nvme/[email protected]/

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 14:35     ` Kanchan Joshi
@ 2024-10-17 15:23       ` Christoph Hellwig
  2024-10-17 15:44         ` Keith Busch
  0 siblings, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-17 15:23 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Christoph Hellwig, axboe, kbusch, hare, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 17, 2024 at 08:05:38PM +0530, Kanchan Joshi wrote:
> Seems per-I/O hints are not getting the love they deserve.
> Apart from the block device, the usecase is when all I/Os of VM (or 
> container) are to be grouped together or placed differently.

But that assumes the file system could actually support it.  Which
is hard when you don't assume the file system isn't simply a passthrough
entity, which will not give you great results.

> > 2) A per-I/O interface to set these temperature hint conflicts badly
> > with how placement works in file systems.  If we have an urgent need
> > for it on the block device it needs to be opt-in by the file operations
> > so it can be enabled on block device, but not on file systems by
> > default.  This way you can implement it for block device, but not
> > provide it on file systems by default.  If a given file system finds
> > a way to implement it it can still opt into implementing it of course.
> 
> Why do you see this as something that is so different across filesystems 
> that they would need to "find a way to implement"?

If you want to do useful stream separation you need to write data
sequentially into the stream.  Now with streams or FDP that does not
actually imply sequentially in LBA space, but if you want the file
system to not actually deal with fragmentation from hell, and be
easily track what is grouped together you really want it sequentially
in the LBA space as well.  In other words, any kind of write placement
needs to be intimately tied to the file system block allocator.

> Both per-file and per-io hints are supplied by userspace. Inode and 
> kiocb only happen to be the mean to receive the hint information.
> FS is free to use this information (iff it wants) or simply forward this 
> down.

As mentioned above just passing it down is not actually very useful.
It might give you nice benchmark numbers when you basically reimplement
space management in userspace on a fully preallocated file, but for that
you're better of just using the block device.  If you actually want
to treat the files as files you need full file system involvement.

> Per-file hint just gets stored (within inode) without individual FS 
> involvement. Per-io hint follows the same model (i.e., it is set by 
> upper layer like io_uring/aio) and uses kiocb to store the hint. It does 
> not alter the stored inode hint value!

Yes, and now you'll get complaints that the file system ignores it
when it can't properly support it.  This is why we need a per-fop
opt in.

> The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) 
> already forwards the incoming hints, without any intelligence.

Yes, and that is a problem.  We stopped doing that, but Samsung sneaked
some of this back in recently as I noticed.

> Overall, I do not see the conflict. It's all user-driven. No?

I have the gut feeling that you've just run benchmarks on image files
emulating block devices and not actually tried real file system workloads
based on this unfortunately.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 15:23       ` Christoph Hellwig
@ 2024-10-17 15:44         ` Keith Busch
  2024-10-17 15:46           ` Christoph Hellwig
  2024-10-17 16:15           ` Bart Van Assche
  0 siblings, 2 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-17 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
> If you want to do useful stream separation you need to write data
> sequentially into the stream.  Now with streams or FDP that does not
> actually imply sequentially in LBA space, but if you want the file
> system to not actually deal with fragmentation from hell, and be
> easily track what is grouped together you really want it sequentially
> in the LBA space as well.  In other words, any kind of write placement
> needs to be intimately tied to the file system block allocator.

I'm replying just to make sure I understand what you're saying:

If we send per IO hints on a file, we could have interleaved hot and
cold pages at various offsets of that file, so the filesystem needs an
efficient way to allocate extents and track these so that it doesn't
interleave these in LBA space. I think that makes sense.

We can add a fop_flags and block/fops.c can be the first one to turn it
on since that LBA access is entirely user driven.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 15:44         ` Keith Busch
@ 2024-10-17 15:46           ` Christoph Hellwig
  2024-10-17 16:06             ` Keith Busch
  2024-10-17 16:15           ` Bart Van Assche
  1 sibling, 1 reply; 86+ messages in thread
From: Christoph Hellwig @ 2024-10-17 15:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring,
	linux-block, linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote:
> I'm replying just to make sure I understand what you're saying:
> 
> If we send per IO hints on a file, we could have interleaved hot and
> cold pages at various offsets of that file, so the filesystem needs an
> efficient way to allocate extents and track these so that it doesn't
> interleave these in LBA space. I think that makes sense.

That's a little simplified, but yes.

> We can add a fop_flags and block/fops.c can be the first one to turn it
> on since that LBA access is entirely user driven.

Yes, that's my main request on the per-I/O hint interface.

Now we just need to not dumb down the bio level interface to five
temeperature level and just expose the different write streams and we
can all have a happy Kumbaya.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 15:46           ` Christoph Hellwig
@ 2024-10-17 16:06             ` Keith Busch
  0 siblings, 0 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-17 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence,
	linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz

On Thu, Oct 17, 2024 at 05:46:49PM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote:
> 
> > We can add a fop_flags and block/fops.c can be the first one to turn it
> > on since that LBA access is entirely user driven.
> 
> Yes, that's my main request on the per-I/O hint interface.
> 
> Now we just need to not dumb down the bio level interface to five
> temeperature level and just expose the different write streams and we
> can all have a happy Kumbaya.

I'm sending a revised version of this patch set that attempts to address
these issues.

The io_uring hint is weird to me, so that part is simplified just to get
the point across.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 15:44         ` Keith Busch
  2024-10-17 15:46           ` Christoph Hellwig
@ 2024-10-17 16:15           ` Bart Van Assche
  2024-10-17 16:23             ` Keith Busch
  1 sibling, 1 reply; 86+ messages in thread
From: Bart Van Assche @ 2024-10-17 16:15 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro,
	jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz

On 10/17/24 8:44 AM, Keith Busch wrote:
> On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
>> If you want to do useful stream separation you need to write data
>> sequentially into the stream.  Now with streams or FDP that does not
>> actually imply sequentially in LBA space, but if you want the file
>> system to not actually deal with fragmentation from hell, and be
>> easily track what is grouped together you really want it sequentially
>> in the LBA space as well.  In other words, any kind of write placement
>> needs to be intimately tied to the file system block allocator.
> 
> I'm replying just to make sure I understand what you're saying:
> 
> If we send per IO hints on a file, we could have interleaved hot and
> cold pages at various offsets of that file, so the filesystem needs an
> efficient way to allocate extents and track these so that it doesn't
> interleave these in LBA space. I think that makes sense.
> 
> We can add a fop_flags and block/fops.c can be the first one to turn it
> on since that LBA access is entirely user driven.

Does anyone care about buffered I/O to block devices? When using
buffered I/O, the write_hint information from the inode is used and the 
per I/O write_hint information is ignored.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v7 0/3] FDP and per-io hints
  2024-10-17 16:15           ` Bart Van Assche
@ 2024-10-17 16:23             ` Keith Busch
  0 siblings, 0 replies; 86+ messages in thread
From: Keith Busch @ 2024-10-17 16:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block,
	linux-aio, gost.dev, vishak.g, javier.gonz

On Thu, Oct 17, 2024 at 09:15:21AM -0700, Bart Van Assche wrote:
> On 10/17/24 8:44 AM, Keith Busch wrote:
> > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
> > > If you want to do useful stream separation you need to write data
> > > sequentially into the stream.  Now with streams or FDP that does not
> > > actually imply sequentially in LBA space, but if you want the file
> > > system to not actually deal with fragmentation from hell, and be
> > > easily track what is grouped together you really want it sequentially
> > > in the LBA space as well.  In other words, any kind of write placement
> > > needs to be intimately tied to the file system block allocator.
> > 
> > I'm replying just to make sure I understand what you're saying:
> > 
> > If we send per IO hints on a file, we could have interleaved hot and
> > cold pages at various offsets of that file, so the filesystem needs an
> > efficient way to allocate extents and track these so that it doesn't
> > interleave these in LBA space. I think that makes sense.
> > 
> > We can add a fop_flags and block/fops.c can be the first one to turn it
> > on since that LBA access is entirely user driven.
> 
> Does anyone care about buffered I/O to block devices? When using
> buffered I/O, the write_hint information from the inode is used and the per
> I/O write_hint information is ignored.

I'm pretty sure there are applications that use buffered IO on raw block
(ex: postgresql), but it's a moot point: the block file_operations that
provide the fops_flags also provide the callbacks for O_DIRECT, which is
where this matters.

We can't really use per-io write_hints on buffered-io. At least not yet,
and maybe never. I'm not sure if it makes sense for raw block because
the page writes won't necessarily match writes to storage.

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2024-10-17 16:24 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com>
2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi
     [not found]   ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>
2024-09-30 18:13     ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi
2024-10-02 18:37       ` Bart Van Assche
2024-10-03 12:55         ` Christoph Hellwig
     [not found]   ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com>
2024-09-30 18:13     ` [PATCH v7 2/3] block, fs: restore kiocb based write hint processing Kanchan Joshi
     [not found]   ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>
2024-09-30 18:13     ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
2024-10-02 14:26       ` Pavel Begunkov
2024-10-17 14:58         ` Kanchan Joshi
2024-10-02 18:29       ` Bart Van Assche
2024-10-01  9:20   ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig
2024-10-01 15:58     ` James R. Bergsten
2024-10-01 16:18     ` Jens Axboe
2024-10-02  7:51       ` Christoph Hellwig
2024-10-02 15:03         ` Jens Axboe
2024-10-02 15:13           ` Christoph Hellwig
2024-10-02 15:17             ` Keith Busch
2024-10-02 15:19               ` Christoph Hellwig
2024-10-02 15:33                 ` Keith Busch
2024-10-03 12:51                   ` Christoph Hellwig
2024-10-02 15:47                 ` Martin K. Petersen
2024-10-02 18:34                   ` Bart Van Assche
2024-10-03 12:55                     ` Christoph Hellwig
2024-10-03 21:48                       ` Keith Busch
2024-10-03 22:00                         ` Bart Van Assche
2024-10-03 22:12                           ` Jens Axboe
2024-10-03 22:17                           ` Keith Busch
2024-10-04  6:21                       ` Javier González
2024-10-04  6:24                         ` Christoph Hellwig
2024-10-04  6:59                           ` Javier González
2024-10-04 12:32                             ` Christoph Hellwig
2024-10-07 11:29                               ` Javier González
2024-10-08 12:27                                 ` Christoph Hellwig
2024-10-03 12:54                   ` Christoph Hellwig
2024-10-03 22:14                     ` Jens Axboe
2024-10-04  5:31                       ` Christoph Hellwig
2024-10-04  6:18                         ` Javier González
2024-10-04  6:27                           ` Christoph Hellwig
2024-10-04  6:52                             ` Javier González
2024-10-04 12:30                               ` Christoph Hellwig
2024-10-07 10:10                                 ` Javier González
2024-10-08 10:06                                   ` Hans Holmberg
2024-10-09 14:36                                     ` Javier Gonzalez
2024-10-10  6:40                                       ` Hans Holmberg
2024-10-10  7:13                                         ` Javier Gonzalez
2024-10-10  9:20                                           ` Christoph Hellwig
2024-10-10 12:22                                             ` Javier Gonzalez
2024-10-11  8:56                                               ` Christoph Hellwig
2024-10-11 12:21                                                 ` Javier Gonzalez
2024-10-11 16:59                                                   ` Keith Busch
2024-10-10 10:46                                           ` Hans Holmberg
     [not found]                                             ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com>
2024-10-10 12:27                                               ` Javier Gonzalez
2024-10-11  8:59                                                 ` Christoph Hellwig
2024-10-08 12:25                                   ` Christoph Hellwig
2024-10-08 14:44                                     ` Keith Busch
2024-10-09  9:28                                       ` Christoph Hellwig
2024-10-09 15:06                                         ` Keith Busch
     [not found]                                           ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com>
2024-10-10  7:07                                             ` Javier González
2024-10-10  9:13                                               ` Christoph Hellwig
2024-10-10 11:59                                                 ` Javier González
2024-10-11  9:02                                                   ` Christoph Hellwig
2024-10-11 17:08                                                     ` Jens Axboe
2024-10-14  6:21                                                       ` Christoph Hellwig
2024-10-14  7:02                                                         ` Javier Gonzalez
2024-10-14  7:47                                                           ` Christoph Hellwig
2024-10-14  9:08                                                             ` Javier Gonzalez
2024-10-14 11:50                                                               ` Christoph Hellwig
2024-10-15  3:07                                                                 ` Javier Gonzalez
2024-10-15  5:30                                                                   ` Christoph Hellwig
2024-10-10  9:10                                           ` Christoph Hellwig
2024-10-09 16:28                                         ` Nitesh Shetty
2024-10-02 15:22             ` Jens Axboe
2024-10-01 16:23     ` Keith Busch
2024-10-02  7:49       ` Christoph Hellwig
2024-10-02 14:56         ` Keith Busch
2024-10-02 15:00           ` Jens Axboe
2024-10-03  0:20   ` Bart Van Assche
2024-10-15  5:50   ` Christoph Hellwig
2024-10-15 15:09     ` Keith Busch
2024-10-15 15:22       ` Christoph Hellwig
2024-10-17 14:35     ` Kanchan Joshi
2024-10-17 15:23       ` Christoph Hellwig
2024-10-17 15:44         ` Keith Busch
2024-10-17 15:46           ` Christoph Hellwig
2024-10-17 16:06             ` Keith Busch
2024-10-17 16:15           ` Bart Van Assche
2024-10-17 16:23             ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox