public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v6 0/3] per-io hints and FDP
       [not found] <CGME20240924093247epcas5p4807fe5e531a2b7b2d6961d23bc989c80@epcas5p4.samsung.com>
@ 2024-09-24  9:24 ` Kanchan Joshi
       [not found]   ` <CGME20240924093250epcas5p39259624b9ebabdef15081ea9bd663d41@epcas5p3.samsung.com>
                     ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-24  9:24 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi

Another spin to incorporate the feedback from LPC and previous
iteration. The series adds two capabilities:
- FDP support at NVMe level (patch #1)
- Per-io hinting via io_uring (patch #3)
Patch #2 is needed to do per-io hints.

The motivation and interface details are present in the commit
descriptions.

Testing:
Done with fcntl and liburing based custom applications.
On raw block device, ext4, xfs, btrfs and F2FS.
Checked that no regression occurs for application that use per-inode
hints.
Checked that per-io hints, when passed, take the precedence over per-inode
hints.

Changes since v5:
- Drop placement hints
- Add per-io hint interface

Changes since v4:
- Retain the size/type checking on the enum (Bart)
- Use the name "*_lifetime_hint" rather than "*_life_hint" (Bart)

Changes since v3:
- 4 new patches to introduce placement hints
- Make nvme patch use the placement hints rather than lifetime hints

Changes since v2:
- Base it on nvme-6.11 and resolve a merge conflict

Changes since v1:
- Reduce the fetched plids from 128 to 6 (Keith)
- Use struct_size for a calculation (Keith)
- Handle robot/sparse warning

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

 block/fops.c                  |  6 +--
 drivers/nvme/host/core.c      | 70 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  4 ++
 fs/aio.c                      |  1 +
 fs/cachefiles/io.c            |  1 +
 fs/direct-io.c                |  2 +-
 fs/fcntl.c                    | 22 -----------
 fs/iomap/direct-io.c          |  2 +-
 include/linux/fs.h            |  8 ++++
 include/linux/nvme.h          | 19 ++++++++++
 include/linux/rw_hint.h       | 24 ++++++++++++
 include/uapi/linux/io_uring.h | 10 +++++
 io_uring/rw.c                 | 20 ++++++++++
 13 files changed, 162 insertions(+), 27 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v6 1/3] nvme: enable FDP support
       [not found]   ` <CGME20240924093250epcas5p39259624b9ebabdef15081ea9bd663d41@epcas5p3.samsung.com>
@ 2024-09-24  9:24     ` Kanchan Joshi
  2024-09-24  9:39       ` Christoph Hellwig
  2024-09-25  5:48       ` Hannes Reinecke
  0 siblings, 2 replies; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-24  9:24 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Hui Qi,
	Nitesh Shetty

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.

Userspace can send the data lifetime information using the write hints.
The SCSI driver (sd) can already pass this information to the SCSI
devices. This patch does the same for NVMe.

Fetch the placement-identifiers if the device supports FDP.
The incoming write-hint is mapped to a placement-identifier, which in
turn is set in the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Hui Qi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 drivers/nvme/host/core.c | 70 ++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  4 +++
 include/linux/nvme.h     | 19 +++++++++++
 3 files changed, 93 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ca9959a8fb9e..7fb3ed4fe9c0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -44,6 +44,20 @@ struct nvme_ns_info {
 	bool is_removed;
 };
 
+struct nvme_fdp_ruh_status_desc {
+	u16 pid;
+	u16 ruhid;
+	u32 earutr;
+	u64 ruamw;
+	u8  rsvd16[16];
+};
+
+struct nvme_fdp_ruh_status {
+	u8  rsvd0[14];
+	__le16 nruhsd;
+	struct nvme_fdp_ruh_status_desc ruhsd[];
+};
+
 unsigned int admin_timeout = 60;
 module_param(admin_timeout, uint, 0644);
 MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
@@ -959,6 +973,19 @@ static bool nvme_valid_atomic_write(struct request *req)
 	return true;
 }
 
+static inline void nvme_assign_placement_id(struct nvme_ns *ns,
+					struct request *req,
+					struct nvme_command *cmd)
+{
+	enum rw_hint h = req->write_hint;
+
+	if (h >= ns->head->nr_plids)
+		return;
+
+	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
+	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -1078,6 +1105,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req)
 		break;
 	case REQ_OP_WRITE:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+		if (!ret && ns->head->nr_plids)
+			nvme_assign_placement_id(ns, req, cmd);
 		break;
 	case REQ_OP_ZONE_APPEND:
 		ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
@@ -2114,6 +2143,40 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	return ret;
 }
 
+static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
+{
+	struct nvme_command c = {};
+	struct nvme_fdp_ruh_status *ruhs;
+	struct nvme_fdp_ruh_status_desc *ruhsd;
+	int size, ret, i;
+
+	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
+	ruhs = kzalloc(size, GFP_KERNEL);
+	if (!ruhs)
+		return -ENOMEM;
+
+	c.imr.opcode = nvme_cmd_io_mgmt_recv;
+	c.imr.nsid = cpu_to_le32(nsid);
+	c.imr.mo = 0x1;
+	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
+
+	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
+	if (ret)
+		goto out;
+
+	ns->head->nr_plids = le16_to_cpu(ruhs->nruhsd);
+	ns->head->nr_plids =
+		min_t(u16, ns->head->nr_plids, NVME_MAX_PLIDS);
+
+	for (i = 0; i < ns->head->nr_plids; i++) {
+		ruhsd = &ruhs->ruhsd[i];
+		ns->head->plids[i] = le16_to_cpu(ruhsd->pid);
+	}
+out:
+	kfree(ruhs);
+	return ret;
+}
+
 static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		struct nvme_ns_info *info)
 {
@@ -2205,6 +2268,13 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		if (ret && !nvme_first_scan(ns->disk))
 			goto out;
 	}
+	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
+		ret = nvme_fetch_fdp_plids(ns, info->nsid);
+		if (ret)
+			dev_warn(ns->ctrl->device,
+				"FDP failure status:0x%x\n", ret);
+	}
+
 
 	ret = 0;
 out:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 313a4f978a2c..a959a9859e8b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -454,6 +454,8 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+#define NVME_MAX_PLIDS   (WRITE_LIFE_EXTREME + 1)
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -490,6 +492,8 @@ struct nvme_ns_head {
 	struct device		cdev_device;
 
 	struct gendisk		*disk;
+	u16			nr_plids;
+	u16			plids[NVME_MAX_PLIDS];
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b58d9405d65e..a954eaee5b0f 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_attr {
 	NVME_CTRL_ATTR_HID_128_BIT	= (1 << 0),
 	NVME_CTRL_ATTR_TBKAS		= (1 << 6),
 	NVME_CTRL_ATTR_ELBAS		= (1 << 15),
+	NVME_CTRL_ATTR_FDPS		= (1 << 19),
 };
 
 struct nvme_id_ctrl {
@@ -843,6 +844,7 @@ enum nvme_opcode {
 	nvme_cmd_resv_register	= 0x0d,
 	nvme_cmd_resv_report	= 0x0e,
 	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_io_mgmt_recv	= 0x12,
 	nvme_cmd_resv_release	= 0x15,
 	nvme_cmd_zone_mgmt_send	= 0x79,
 	nvme_cmd_zone_mgmt_recv	= 0x7a,
@@ -864,6 +866,7 @@ enum nvme_opcode {
 		nvme_opcode_name(nvme_cmd_resv_register),	\
 		nvme_opcode_name(nvme_cmd_resv_report),		\
 		nvme_opcode_name(nvme_cmd_resv_acquire),	\
+		nvme_opcode_name(nvme_cmd_io_mgmt_recv),	\
 		nvme_opcode_name(nvme_cmd_resv_release),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_send),	\
 		nvme_opcode_name(nvme_cmd_zone_mgmt_recv),	\
@@ -1015,6 +1018,7 @@ enum {
 	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
 	NVME_RW_PRINFO_PRACT		= 1 << 13,
 	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+	NVME_RW_DTYPE_DPLCMT		= 2 << 4,
 	NVME_WZ_DEAC			= 1 << 9,
 };
 
@@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_io_mgmt_recv_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			mo;
+	__u8			rsvd11;
+	__u16			mos;
+	__le32			numd;
+	__le32			cdw12[4];
+};
+
 enum {
 	NVME_ZRA_ZONE_REPORT		= 0,
 	NVME_ZRASF_ZONE_REPORT_ALL	= 0,
@@ -1822,6 +1840,7 @@ struct nvme_command {
 		struct nvmf_auth_receive_command auth_receive;
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
+		struct nvme_io_mgmt_recv_cmd imr;
 	};
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 2/3] block, fs: restore kiocb based write hint processing
       [not found]   ` <CGME20240924093254epcas5p491d7f7cb62dbbf05fe29e0e75d44bff5@epcas5p4.samsung.com>
@ 2024-09-24  9:24     ` Kanchan Joshi
  2024-09-25  5:49       ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-24  9:24 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
("fs: remove kiocb.ki_hint").
But write hint has made a comeback with commit 449813515d3e ("block, fs:
Restore the per-bio/request data lifetime fields").

This patch uses the leftover space in kiocb to carve 1 byte field
ki_write_hint.
Restore the code that operates on kiocb to use ki_write_hint instead of
inode hint value.

This does not bring any behavior change, but needed to enable per-io
hints (by another patch).

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 block/fops.c         | 6 +++---
 fs/aio.c             | 1 +
 fs/cachefiles/io.c   | 1 +
 fs/direct-io.c       | 2 +-
 fs/iomap/direct-io.c | 2 +-
 include/linux/fs.h   | 8 ++++++++
 io_uring/rw.c        | 1 +
 7 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index e69deb6d8635..3b8c0858a4fe 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
diff --git a/fs/aio.c b/fs/aio.c
index e8920178b50f..db618817e670 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
 	req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
 		req->ki_flags |= IOCB_EVENTFD;
+	req->ki_write_hint = file_write_hint(req->ki_filp);
 	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
 		/*
 		 * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 6a821a959b59..c3db102ae64e 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object,
 	ki->iocb.ki_pos		= start_pos;
 	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE;
 	ki->iocb.ki_ioprio	= get_current_ioprio();
+	ki->iocb.ki_write_hint  = file_write_hint(file);
 	ki->object		= object;
 	ki->start		= start_pos;
 	ki->len			= len;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index bbd05f1a2145..73629e26becb 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_io;
 	if (dio->is_pinned)
 		bio_set_flag(bio, BIO_PAGE_PINNED);
-	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = dio->iocb->ki_write_hint;
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..583189796f0c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -380,7 +380,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
-		bio->bi_write_hint = inode->i_write_hint;
+		bio->bi_write_hint = dio->iocb->ki_write_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0df3e5f0dd2b..00c7b05a2496 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	enum rw_hint		ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
@@ -2336,12 +2337,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
 	       !vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
 }
 
+static inline enum rw_hint file_write_hint(struct file *filp)
+{
+	return file_inode(filp)->i_write_hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
 		.ki_filp = filp,
 		.ki_flags = filp->f_iocb_flags,
 		.ki_ioprio = get_current_ioprio(),
+		.ki_write_hint = file_write_hint(filp),
 	};
 }
 
@@ -2352,6 +2359,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 		.ki_filp = filp,
 		.ki_flags = kiocb_src->ki_flags,
 		.ki_ioprio = kiocb_src->ki_ioprio,
+		.ki_write_hint = kiocb_src->ki_write_hint,
 		.ki_pos = kiocb_src->ki_pos,
 	};
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f023ff49c688..510123d3d837 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1023,6 +1023,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
+	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v6 3/3] io_uring: enable per-io hinting capability
       [not found]   ` <CGME20240924093257epcas5p174955ae79ae2d08a886eeb45a6976d53@epcas5p1.samsung.com>
@ 2024-09-24  9:24     ` Kanchan Joshi
  2024-09-24  9:40       ` Christoph Hellwig
  2024-09-25  5:57       ` Hannes Reinecke
  0 siblings, 2 replies; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-24  9:24 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down.
This can be limiting for large files (and for block device) as all the
writes can be tagged with only one lifetime hint value.
Concurrent writes (with different hint values) are hard to manage.
Per-IO hinting solves that problem.

Allow userspace to pass the write hint type and its value in the SQE.
Two new fields are carved in the leftover space of SQE:
	__u8 hint_type;
	__u64 hint_val;

Adding the hint_type helps in keeping the interface extensible for future
use.
At this point only one type TYPE_WRITE_LIFETIME_HINT is supported. With
this type, user can pass the lifetime hint values that are currently
supported by F_SET_RW_HINT fcntl.

The write handlers (io_prep_rw, io_write) process the hint type/value
and hint value is passed to lower-layer using kiocb. This is good for
supporting direct IO, but not when kiocb is not available (e.g.,
buffered IO).

In general, per-io hints take the precedence on per-inode hints.
Three cases to consider:

Case 1: When hint_type is 0 (explicitly, or implicitly as SQE fields are
initialized to 0), this means user did not send any hint. The per-inode
hint values are set in the kiocb (as before).

Case 2: When hint_type is TYPE_WRITE_LIFETIME_HINT, the hint_value is
set into the kiocb after sanity checking.

Case 3: When hint_type is anything else, this is flagged as an error
and write is failed.

Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
---
 fs/fcntl.c                    | 22 ----------------------
 include/linux/rw_hint.h       | 24 ++++++++++++++++++++++++
 include/uapi/linux/io_uring.h | 10 ++++++++++
 io_uring/rw.c                 | 21 ++++++++++++++++++++-
 4 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 081e5e3d89ea..2eb78035a350 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 }
 #endif
 
-static bool rw_hint_valid(u64 hint)
-{
-	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
-	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
-	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
-	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
-	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
-	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
-
-	switch (hint) {
-	case RWH_WRITE_LIFE_NOT_SET:
-	case RWH_WRITE_LIFE_NONE:
-	case RWH_WRITE_LIFE_SHORT:
-	case RWH_WRITE_LIFE_MEDIUM:
-	case RWH_WRITE_LIFE_LONG:
-	case RWH_WRITE_LIFE_EXTREME:
-		return true;
-	default:
-		return false;
-	}
-}
-
 static long fcntl_get_rw_hint(struct file *file, unsigned int cmd,
 			      unsigned long arg)
 {
diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
index 309ca72f2dfb..f4373a71ffed 100644
--- a/include/linux/rw_hint.h
+++ b/include/linux/rw_hint.h
@@ -21,4 +21,28 @@ enum rw_hint {
 static_assert(sizeof(enum rw_hint) == 1);
 #endif
 
+#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
+
+static inline bool rw_hint_valid(u64 hint)
+{
+	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
+	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
+	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
+	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
+	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
+	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
+
+	switch (hint) {
+	case RWH_WRITE_LIFE_NOT_SET:
+	case RWH_WRITE_LIFE_NONE:
+	case RWH_WRITE_LIFE_SHORT:
+	case RWH_WRITE_LIFE_MEDIUM:
+	case RWH_WRITE_LIFE_LONG:
+	case RWH_WRITE_LIFE_EXTREME:
+		return true;
+	default:
+		return false;
+	}
+}
+
 #endif /* _LINUX_RW_HINT_H */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1fe79e750470..e21a74dd0c49 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,11 @@ struct io_uring_sqe {
 			__u64	addr3;
 			__u64	__pad2[1];
 		};
+		struct {
+			/* To send per-io hint type/value with write command */
+			__u64	hint_val;
+			__u8	hint_type;
+		};
 		__u64	optval;
 		/*
 		 * If the ring is initialized with IORING_SETUP_SQE128, then
@@ -107,6 +112,11 @@ struct io_uring_sqe {
 	};
 };
 
+enum hint_type {
+	/* this type covers the values supported by F_SET_RW_HINT fcntl */
+	TYPE_WRITE_LIFETIME_HINT = 1,
+};
+
 /*
  * If sqe->file_index is set to this for opcodes that instantiate a new
  * direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 510123d3d837..f78ad0ddeef5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -269,6 +269,20 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	if (ddir == ITER_SOURCE) {
+		u8 htype = READ_ONCE(sqe->hint_type);
+
+		rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID;
+		if (htype) {
+			u64 hval = READ_ONCE(sqe->hint_val);
+
+			if (htype != TYPE_WRITE_LIFETIME_HINT ||
+			    !rw_hint_valid(hval))
+				return -EINVAL;
+
+			rw->kiocb.ki_write_hint = hval;
+		}
+	}
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
@@ -1023,7 +1037,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
-	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
+	/*
+	 * Use per-file hint only if per-io hint is not set.
+	 * We need per-io hint to get precedence.
+	 */
+	if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID)
+		rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] nvme: enable FDP support
  2024-09-24  9:24     ` [PATCH v6 1/3] nvme: enable FDP support Kanchan Joshi
@ 2024-09-24  9:39       ` Christoph Hellwig
  2024-09-25  5:48       ` Hannes Reinecke
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-09-24  9:39 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

As far as I can tell this regresseѕ back to before when you tried
to actually make forward progress and is the same as the old version
again :(

For that:

Nacked-by: Christoph Hellwig <[email protected]>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-24  9:24     ` [PATCH v6 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
@ 2024-09-24  9:40       ` Christoph Hellwig
  2024-09-25  5:57       ` Hannes Reinecke
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2024-09-24  9:40 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: axboe, kbusch, hch, sagi, martin.petersen, brauner, viro, jack,
	jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme,
	linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev,
	vishak.g, javier.gonz, Nitesh Shetty

Per-I/O hints really can't work for file system interfaes.  They
will have to be a clear opt in to support for regular files.

I'll do a more detailed review when I'm back next week.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 1/3] nvme: enable FDP support
  2024-09-24  9:24     ` [PATCH v6 1/3] nvme: enable FDP support Kanchan Joshi
  2024-09-24  9:39       ` Christoph Hellwig
@ 2024-09-25  5:48       ` Hannes Reinecke
  1 sibling, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2024-09-25  5:48 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty

On 9/24/24 11:24, Kanchan Joshi wrote:
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> 
> Userspace can send the data lifetime information using the write hints.
> The SCSI driver (sd) can already pass this information to the SCSI
> devices. This patch does the same for NVMe.
> 
> Fetch the placement-identifiers if the device supports FDP.
> The incoming write-hint is mapped to a placement-identifier, which in
> turn is set in the DSPEC field of the write command.
> 
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Hui Qi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> ---
>   drivers/nvme/host/core.c | 70 ++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  4 +++
>   include/linux/nvme.h     | 19 +++++++++++
>   3 files changed, 93 insertions(+)
> 
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
[email protected]                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 2/3] block, fs: restore kiocb based write hint processing
  2024-09-24  9:24     ` [PATCH v6 2/3] block, fs: restore kiocb based write hint processing Kanchan Joshi
@ 2024-09-25  5:49       ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2024-09-25  5:49 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/24/24 11:24, Kanchan Joshi wrote:
> struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
> ("fs: remove kiocb.ki_hint").
> But write hint has made a comeback with commit 449813515d3e ("block, fs:
> Restore the per-bio/request data lifetime fields").
> 
> This patch uses the leftover space in kiocb to carve 1 byte field
> ki_write_hint.
> Restore the code that operates on kiocb to use ki_write_hint instead of
> inode hint value.
> 
> This does not bring any behavior change, but needed to enable per-io
> hints (by another patch).
> 
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> ---
>   block/fops.c         | 6 +++---
>   fs/aio.c             | 1 +
>   fs/cachefiles/io.c   | 1 +
>   fs/direct-io.c       | 2 +-
>   fs/iomap/direct-io.c | 2 +-
>   include/linux/fs.h   | 8 ++++++++
>   io_uring/rw.c        | 1 +
>   7 files changed, 16 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
[email protected]                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-24  9:24     ` [PATCH v6 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
  2024-09-24  9:40       ` Christoph Hellwig
@ 2024-09-25  5:57       ` Hannes Reinecke
  2024-09-25 11:09         ` Kanchan Joshi
  1 sibling, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2024-09-25  5:57 UTC (permalink / raw)
  To: Kanchan Joshi, axboe, kbusch, hch, sagi, martin.petersen, brauner,
	viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/24/24 11:24, Kanchan Joshi wrote:
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down.
> This can be limiting for large files (and for block device) as all the
> writes can be tagged with only one lifetime hint value.
> Concurrent writes (with different hint values) are hard to manage.
> Per-IO hinting solves that problem.
> 
> Allow userspace to pass the write hint type and its value in the SQE.
> Two new fields are carved in the leftover space of SQE:
> 	__u8 hint_type;
> 	__u64 hint_val;
> 
> Adding the hint_type helps in keeping the interface extensible for future
> use.
> At this point only one type TYPE_WRITE_LIFETIME_HINT is supported. With
> this type, user can pass the lifetime hint values that are currently
> supported by F_SET_RW_HINT fcntl.
> 
> The write handlers (io_prep_rw, io_write) process the hint type/value
> and hint value is passed to lower-layer using kiocb. This is good for
> supporting direct IO, but not when kiocb is not available (e.g.,
> buffered IO).
> 
> In general, per-io hints take the precedence on per-inode hints.
> Three cases to consider:
> 
> Case 1: When hint_type is 0 (explicitly, or implicitly as SQE fields are
> initialized to 0), this means user did not send any hint. The per-inode
> hint values are set in the kiocb (as before).
> 
> Case 2: When hint_type is TYPE_WRITE_LIFETIME_HINT, the hint_value is
> set into the kiocb after sanity checking.
> 
> Case 3: When hint_type is anything else, this is flagged as an error
> and write is failed.
> 
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> ---
>   fs/fcntl.c                    | 22 ----------------------
>   include/linux/rw_hint.h       | 24 ++++++++++++++++++++++++
>   include/uapi/linux/io_uring.h | 10 ++++++++++
>   io_uring/rw.c                 | 21 ++++++++++++++++++++-
>   4 files changed, 54 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 081e5e3d89ea..2eb78035a350 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
>   }
>   #endif
>   
> -static bool rw_hint_valid(u64 hint)
> -{
> -	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
> -	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
> -	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
> -	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
> -	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
> -	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
> -
> -	switch (hint) {
> -	case RWH_WRITE_LIFE_NOT_SET:
> -	case RWH_WRITE_LIFE_NONE:
> -	case RWH_WRITE_LIFE_SHORT:
> -	case RWH_WRITE_LIFE_MEDIUM:
> -	case RWH_WRITE_LIFE_LONG:
> -	case RWH_WRITE_LIFE_EXTREME:
> -		return true;
> -	default:
> -		return false;
> -	}
> -}
> -
>   static long fcntl_get_rw_hint(struct file *file, unsigned int cmd,
>   			      unsigned long arg)
>   {
> diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h
> index 309ca72f2dfb..f4373a71ffed 100644
> --- a/include/linux/rw_hint.h
> +++ b/include/linux/rw_hint.h
> @@ -21,4 +21,28 @@ enum rw_hint {
>   static_assert(sizeof(enum rw_hint) == 1);
>   #endif
>   
> +#define	WRITE_LIFE_INVALID	(RWH_WRITE_LIFE_EXTREME + 1)
> +
> +static inline bool rw_hint_valid(u64 hint)
> +{
> +	BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET);
> +	BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE);
> +	BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT);
> +	BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM);
> +	BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG);
> +	BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME);
> +
> +	switch (hint) {
> +	case RWH_WRITE_LIFE_NOT_SET:
> +	case RWH_WRITE_LIFE_NONE:
> +	case RWH_WRITE_LIFE_SHORT:
> +	case RWH_WRITE_LIFE_MEDIUM:
> +	case RWH_WRITE_LIFE_LONG:
> +	case RWH_WRITE_LIFE_EXTREME:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>   #endif /* _LINUX_RW_HINT_H */
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 1fe79e750470..e21a74dd0c49 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -98,6 +98,11 @@ struct io_uring_sqe {
>   			__u64	addr3;
>   			__u64	__pad2[1];
>   		};
> +		struct {
> +			/* To send per-io hint type/value with write command */
> +			__u64	hint_val;
> +			__u8	hint_type;
> +		};
Why is 'hint_val' 64 bits? Everything else is 8 bytes, so wouldn't it
be better to shorten that? As it stands the new struct will introduce
a hole of 24 bytes after 'hint_type'.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
[email protected]                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-25  5:57       ` Hannes Reinecke
@ 2024-09-25 11:09         ` Kanchan Joshi
  2024-09-25 12:23           ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-25 11:09 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, kbusch, hch, sagi, martin.petersen,
	brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche,
	asml.silence
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/25/2024 11:27 AM, Hannes Reinecke wrote:
>> @@ -98,6 +98,11 @@ struct io_uring_sqe {
>>               __u64    addr3;
>>               __u64    __pad2[1];
>>           };
>> +        struct {
>> +            /* To send per-io hint type/value with write command */
>> +            __u64    hint_val;
>> +            __u8    hint_type;
>> +        };
> Why is 'hint_val' 64 bits? Everything else is 8 bytes, so wouldn't it
> be better to shorten that? 

Right, within kernel hint is stored as 8bits value.
But I chose not because how kernel stores hint internally (which may 
change at any time) but how the existing F_SET_RW_HINT interface exposed 
this to user space. It expects u64.

If we do 8bits interface here, application needs to learn that for the 
same lifetime hint it needs u64 for fcntl interface, but u8 for io_uring 
interface. That seems a bit confusing.

Also, in future if we do support another hint type, we may be able to 
pass hint_val beyond what can be supported by u8.

As it stands the new struct will introduce
> a hole of 24 bytes after 'hint_type'.

This gets implicitly padded at this point [1][2], and overall size is 
still capped by largest struct (which is of 16 bytes, placed just above 
this).

[1] On 64bit
»       union {
»       »       struct {
»       »       »       __u64      addr3;                /*    48     8 */
»       »       »       __u64      __pad2[1];            /*    56     8 */
»       »       };                                       /*    48    16 */
»       »       struct {
»       »       »       __u64      hint_val;             /*    48     8 */
»       »       »       __u8       hint_type;            /*    56     1 */
»       »       };                                       /*    48    16 */
»       »       __u64              optval;               /*    48     8 */
»       »       __u8               cmd[0];               /*    48     0 */
»       };                                               /*    48    16 */

»       /* size: 64, cachelines: 1, members: 13 */

[2] On 32bit

»       union {
»       »       struct {
»       »       »       __u64      addr3;                /*    48     8 */
»       »       »       __u64      __pad2[1];            /*    56     8 */
»       »       };                                       /*    48    16 */
»       »       struct {
»       »       »       __u64      hint_val;             /*    48     8 */
»       »       »       __u8       hint_type;            /*    56     1 */
»       »       };                                       /*    48    12 */
»       »       __u64              optval;               /*    48     8 */
»       »       __u8               cmd[0];               /*    48     0 */
»       };                                               /*    48    16 */

»       /* size: 64, cachelines: 1, members: 13 */
};

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-25 11:09         ` Kanchan Joshi
@ 2024-09-25 12:23           ` Pavel Begunkov
  2024-09-25 13:21             ` Kanchan Joshi
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2024-09-25 12:23 UTC (permalink / raw)
  To: Kanchan Joshi, Hannes Reinecke, axboe, kbusch, hch, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/25/24 12:09, Kanchan Joshi wrote:
> On 9/25/2024 11:27 AM, Hannes Reinecke wrote:
...
> As it stands the new struct will introduce
>> a hole of 24 bytes after 'hint_type'.
> 
> This gets implicitly padded at this point [1][2], and overall size is
> still capped by largest struct (which is of 16 bytes, placed just above
> this).

For me it's about having hardly usable in the future by anyone else
7 bytes of space or how much that will be. Try to add another field
using those bytes and endianess will start messing with you. And 7
bytes is not that convenient.

I have same problem with how commands were merged while I was not
looking. There was no explicit padding, and it split u64 into u32
and implicit padding, so no apps can use the space to put a pointer
anymore while there was a much better option of using one of existing
4B fields.


> [1] On 64bit
> »       union {
> »       »       struct {
> »       »       »       __u64      addr3;                /*    48     8 */
> »       »       »       __u64      __pad2[1];            /*    56     8 */
> »       »       };                                       /*    48    16 */
> »       »       struct {
> »       »       »       __u64      hint_val;             /*    48     8 */
> »       »       »       __u8       hint_type;            /*    56     1 */
> »       »       };                                       /*    48    16 */
> »       »       __u64              optval;               /*    48     8 */
> »       »       __u8               cmd[0];               /*    48     0 */
> »       };                                               /*    48    16 */
> 
> »       /* size: 64, cachelines: 1, members: 13 */
> 
> [2] On 32bit
> 
> »       union {
> »       »       struct {
> »       »       »       __u64      addr3;                /*    48     8 */
> »       »       »       __u64      __pad2[1];            /*    56     8 */
> »       »       };                                       /*    48    16 */
> »       »       struct {
> »       »       »       __u64      hint_val;             /*    48     8 */
> »       »       »       __u8       hint_type;            /*    56     1 */
> »       »       };                                       /*    48    12 */
> »       »       __u64              optval;               /*    48     8 */
> »       »       __u8               cmd[0];               /*    48     0 */
> »       };                                               /*    48    16 */
> 
> »       /* size: 64, cachelines: 1, members: 13 */
> };

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-25 12:23           ` Pavel Begunkov
@ 2024-09-25 13:21             ` Kanchan Joshi
  2024-09-26 20:09               ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Kanchan Joshi @ 2024-09-25 13:21 UTC (permalink / raw)
  To: Pavel Begunkov, Hannes Reinecke, axboe, kbusch, hch, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/25/2024 5:53 PM, Pavel Begunkov wrote:
> On 9/25/24 12:09, Kanchan Joshi wrote:
>> On 9/25/2024 11:27 AM, Hannes Reinecke wrote:
> ...
>> As it stands the new struct will introduce
>>> a hole of 24 bytes after 'hint_type'.
>>
>> This gets implicitly padded at this point [1][2], and overall size is
>> still capped by largest struct (which is of 16 bytes, placed just above
>> this).
> 
> For me it's about having hardly usable in the future by anyone else
> 7 bytes of space or how much that will be. Try to add another field
> using those bytes and endianess will start messing with you. And 7
> bytes is not that convenient.
> 
> I have same problem with how commands were merged while I was not
> looking. There was no explicit padding, and it split u64 into u32
> and implicit padding, so no apps can use the space to put a pointer
> anymore while there was a much better option of using one of existing
> 4B fields.

How would you prefer it. Explicit padding (7 bytes), hint_type as u16 or 
anything else?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v6 3/3] io_uring: enable per-io hinting capability
  2024-09-25 13:21             ` Kanchan Joshi
@ 2024-09-26 20:09               ` Pavel Begunkov
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-09-26 20:09 UTC (permalink / raw)
  To: Kanchan Joshi, Hannes Reinecke, axboe, kbusch, hch, sagi,
	martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells,
	bvanassche
  Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio,
	gost.dev, vishak.g, javier.gonz, Nitesh Shetty

On 9/25/24 14:21, Kanchan Joshi wrote:
> On 9/25/2024 5:53 PM, Pavel Begunkov wrote:
>> On 9/25/24 12:09, Kanchan Joshi wrote:
>>> On 9/25/2024 11:27 AM, Hannes Reinecke wrote:
>> ...
>>> As it stands the new struct will introduce
>>>> a hole of 24 bytes after 'hint_type'.
>>>
>>> This gets implicitly padded at this point [1][2], and overall size is
>>> still capped by largest struct (which is of 16 bytes, placed just above
>>> this).
>>
>> For me it's about having hardly usable in the future by anyone else
>> 7 bytes of space or how much that will be. Try to add another field
>> using those bytes and endianess will start messing with you. And 7
>> bytes is not that convenient.
>>
>> I have same problem with how commands were merged while I was not
>> looking. There was no explicit padding, and it split u64 into u32
>> and implicit padding, so no apps can use the space to put a pointer
>> anymore while there was a much better option of using one of existing
>> 4B fields.
> 
> How would you prefer it. Explicit padding (7 bytes), hint_type as u16 or
> anything else?

Explicit padding is better than the current version. Ideally,
I'd like the new fields gone (e.g. if it goes in the direction
of per file hints) or prefer to minimise the size and make the
leftover padding reusable, but that depends on what the feature
needs to be extendable.

And what hint types do we expect in the future? Another question,
don't we want an apui that allows to pass multiple hints? Quite
similar to what I asked about "meta" rw, and it might actually
make a lot of sense to combine them into common infra, like what
cmsg is for networking.

meta[] = [ {INTEGRITY, integrity_params},
            {write_hint, ...},
            ...];

Even though an actual impl would need to be a bit more elaborated.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-09-26 20:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240924093247epcas5p4807fe5e531a2b7b2d6961d23bc989c80@epcas5p4.samsung.com>
2024-09-24  9:24 ` [PATCH v6 0/3] per-io hints and FDP Kanchan Joshi
     [not found]   ` <CGME20240924093250epcas5p39259624b9ebabdef15081ea9bd663d41@epcas5p3.samsung.com>
2024-09-24  9:24     ` [PATCH v6 1/3] nvme: enable FDP support Kanchan Joshi
2024-09-24  9:39       ` Christoph Hellwig
2024-09-25  5:48       ` Hannes Reinecke
     [not found]   ` <CGME20240924093254epcas5p491d7f7cb62dbbf05fe29e0e75d44bff5@epcas5p4.samsung.com>
2024-09-24  9:24     ` [PATCH v6 2/3] block, fs: restore kiocb based write hint processing Kanchan Joshi
2024-09-25  5:49       ` Hannes Reinecke
     [not found]   ` <CGME20240924093257epcas5p174955ae79ae2d08a886eeb45a6976d53@epcas5p1.samsung.com>
2024-09-24  9:24     ` [PATCH v6 3/3] io_uring: enable per-io hinting capability Kanchan Joshi
2024-09-24  9:40       ` Christoph Hellwig
2024-09-25  5:57       ` Hannes Reinecke
2024-09-25 11:09         ` Kanchan Joshi
2024-09-25 12:23           ` Pavel Begunkov
2024-09-25 13:21             ` Kanchan Joshi
2024-09-26 20:09               ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox