* [PATCH v7 0/3] FDP and per-io hints [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com> @ 2024-09-30 18:13 ` Kanchan Joshi [not found] ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com> ` (5 more replies) 0 siblings, 6 replies; 86+ messages in thread From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw) To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Kanchan Joshi Another spin to incorporate the feedback from LPC and previous iterations. The series adds two capabilities: - FDP support at NVMe level (patch #1) - Per-io hinting via io_uring (patch #3) Patch #2 is needed to do per-io hints. The motivation and interface details are present in the commit descriptions. Testing: Done with fcntl and liburing based custom applications. On raw block device, ext4, xfs, btrfs and F2FS. Checked that no regression occurs for application that use per-inode hints. Checked that per-io hints, when passed, take the precedence over per-inode hints. Changes since v6: - Change io_uring interface to pass hints as SQE metadata (Pavel, Hannes) Changes since v5: - Drop placement hints - Add per-io hint interface Changes since v4: - Retain the size/type checking on the enum (Bart) - Use the name "*_lifetime_hint" rather than "*_life_hint" (Bart) Changes since v3: - 4 new patches to introduce placement hints - Make nvme patch use the placement hints rather than lifetime hints Changes since v2: - Base it on nvme-6.11 and resolve a merge conflict Changes since v1: - Reduce the fetched plids from 128 to 6 (Keith) - Use struct_size for a calculation (Keith) - Handle robot/sparse warning Kanchan Joshi (3): nvme: enable FDP support block, fs: restore kiocb based write hint processing io_uring: enable per-io hinting capability Kanchan Joshi (3): nvme: enable FDP support block, fs: restore kiocb based write hint processing io_uring: enable per-io hinting capability block/fops.c | 6 +-- drivers/nvme/host/core.c | 70 +++++++++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 4 ++ fs/aio.c | 1 + fs/cachefiles/io.c | 1 + fs/direct-io.c | 2 +- fs/fcntl.c | 22 ----------- fs/iomap/direct-io.c | 2 +- include/linux/fs.h | 8 ++++ include/linux/nvme.h | 19 ++++++++++ include/linux/rw_hint.h | 24 ++++++++++++ include/uapi/linux/io_uring.h | 19 ++++++++++ io_uring/rw.c | 24 ++++++++++++ 13 files changed, 175 insertions(+), 27 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 86+ messages in thread
[parent not found: <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com>]
* [PATCH v7 1/3] nvme: enable FDP support [not found] ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com> @ 2024-09-30 18:13 ` Kanchan Joshi 2024-10-02 18:37 ` Bart Van Assche 0 siblings, 1 reply; 86+ messages in thread From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw) To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Hui Qi, Nitesh Shetty Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host to control the placement of logical blocks so as to reduce the SSD WAF. Userspace can send the data lifetime information using the write hints. The SCSI driver (sd) can already pass this information to the SCSI devices. This patch does the same for NVMe. Fetch the placement-identifiers if the device supports FDP. The incoming write-hint is mapped to a placement-identifier, which in turn is set in the DSPEC field of the write command. Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Hui Qi <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> Nacked-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> --- drivers/nvme/host/core.c | 70 ++++++++++++++++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 4 +++ include/linux/nvme.h | 19 +++++++++++ 3 files changed, 93 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index ba6508455e18..ad5cc1ec8c4f 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -44,6 +44,20 @@ struct nvme_ns_info { bool is_removed; }; +struct nvme_fdp_ruh_status_desc { + u16 pid; + u16 ruhid; + u32 earutr; + u64 ruamw; + u8 rsvd16[16]; +}; + +struct nvme_fdp_ruh_status { + u8 rsvd0[14]; + __le16 nruhsd; + struct nvme_fdp_ruh_status_desc ruhsd[]; +}; + unsigned int admin_timeout = 60; module_param(admin_timeout, uint, 0644); MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands"); @@ -959,6 +973,19 @@ static bool nvme_valid_atomic_write(struct request *req) return true; } +static inline void nvme_assign_placement_id(struct nvme_ns *ns, + struct request *req, + struct nvme_command *cmd) +{ + enum rw_hint h = req->write_hint; + + if (h >= ns->head->nr_plids) + return; + + cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT); + cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16); +} + static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd, enum nvme_opcode op) @@ -1078,6 +1105,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req) break; case REQ_OP_WRITE: ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write); + if (!ret && ns->head->nr_plids) + nvme_assign_placement_id(ns, req, cmd); break; case REQ_OP_ZONE_APPEND: ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); @@ -2114,6 +2143,40 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns, return ret; } +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid) +{ + struct nvme_command c = {}; + struct nvme_fdp_ruh_status *ruhs; + struct nvme_fdp_ruh_status_desc *ruhsd; + int size, ret, i; + + size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS); + ruhs = kzalloc(size, GFP_KERNEL); + if (!ruhs) + return -ENOMEM; + + c.imr.opcode = nvme_cmd_io_mgmt_recv; + c.imr.nsid = cpu_to_le32(nsid); + c.imr.mo = 0x1; + c.imr.numd = cpu_to_le32((size >> 2) - 1); + + ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size); + if (ret) + goto out; + + ns->head->nr_plids = le16_to_cpu(ruhs->nruhsd); + ns->head->nr_plids = + min_t(u16, ns->head->nr_plids, NVME_MAX_PLIDS); + + for (i = 0; i < ns->head->nr_plids; i++) { + ruhsd = &ruhs->ruhsd[i]; + ns->head->plids[i] = le16_to_cpu(ruhsd->pid); + } +out: + kfree(ruhs); + return ret; +} + static int nvme_update_ns_info_block(struct nvme_ns *ns, struct nvme_ns_info *info) { @@ -2205,6 +2268,13 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, if (ret && !nvme_first_scan(ns->disk)) goto out; } + if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) { + ret = nvme_fetch_fdp_plids(ns, info->nsid); + if (ret) + dev_warn(ns->ctrl->device, + "FDP failure status:0x%x\n", ret); + } + ret = 0; out: diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 313a4f978a2c..a959a9859e8b 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -454,6 +454,8 @@ struct nvme_ns_ids { u8 csi; }; +#define NVME_MAX_PLIDS (WRITE_LIFE_EXTREME + 1) + /* * Anchor structure for namespaces. There is one for each namespace in a * NVMe subsystem that any of our controllers can see, and the namespace @@ -490,6 +492,8 @@ struct nvme_ns_head { struct device cdev_device; struct gendisk *disk; + u16 nr_plids; + u16 plids[NVME_MAX_PLIDS]; #ifdef CONFIG_NVME_MULTIPATH struct bio_list requeue_list; spinlock_t requeue_lock; diff --git a/include/linux/nvme.h b/include/linux/nvme.h index b58d9405d65e..a954eaee5b0f 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -275,6 +275,7 @@ enum nvme_ctrl_attr { NVME_CTRL_ATTR_HID_128_BIT = (1 << 0), NVME_CTRL_ATTR_TBKAS = (1 << 6), NVME_CTRL_ATTR_ELBAS = (1 << 15), + NVME_CTRL_ATTR_FDPS = (1 << 19), }; struct nvme_id_ctrl { @@ -843,6 +844,7 @@ enum nvme_opcode { nvme_cmd_resv_register = 0x0d, nvme_cmd_resv_report = 0x0e, nvme_cmd_resv_acquire = 0x11, + nvme_cmd_io_mgmt_recv = 0x12, nvme_cmd_resv_release = 0x15, nvme_cmd_zone_mgmt_send = 0x79, nvme_cmd_zone_mgmt_recv = 0x7a, @@ -864,6 +866,7 @@ enum nvme_opcode { nvme_opcode_name(nvme_cmd_resv_register), \ nvme_opcode_name(nvme_cmd_resv_report), \ nvme_opcode_name(nvme_cmd_resv_acquire), \ + nvme_opcode_name(nvme_cmd_io_mgmt_recv), \ nvme_opcode_name(nvme_cmd_resv_release), \ nvme_opcode_name(nvme_cmd_zone_mgmt_send), \ nvme_opcode_name(nvme_cmd_zone_mgmt_recv), \ @@ -1015,6 +1018,7 @@ enum { NVME_RW_PRINFO_PRCHK_GUARD = 1 << 12, NVME_RW_PRINFO_PRACT = 1 << 13, NVME_RW_DTYPE_STREAMS = 1 << 4, + NVME_RW_DTYPE_DPLCMT = 2 << 4, NVME_WZ_DEAC = 1 << 9, }; @@ -1102,6 +1106,20 @@ struct nvme_zone_mgmt_recv_cmd { __le32 cdw14[2]; }; +struct nvme_io_mgmt_recv_cmd { + __u8 opcode; + __u8 flags; + __u16 command_id; + __le32 nsid; + __le64 rsvd2[2]; + union nvme_data_ptr dptr; + __u8 mo; + __u8 rsvd11; + __u16 mos; + __le32 numd; + __le32 cdw12[4]; +}; + enum { NVME_ZRA_ZONE_REPORT = 0, NVME_ZRASF_ZONE_REPORT_ALL = 0, @@ -1822,6 +1840,7 @@ struct nvme_command { struct nvmf_auth_receive_command auth_receive; struct nvme_dbbuf dbbuf; struct nvme_directive_cmd directive; + struct nvme_io_mgmt_recv_cmd imr; }; }; -- 2.25.1 ^ permalink raw reply related [flat|nested] 86+ messages in thread
* Re: [PATCH v7 1/3] nvme: enable FDP support 2024-09-30 18:13 ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi @ 2024-10-02 18:37 ` Bart Van Assche 2024-10-03 12:55 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Bart Van Assche @ 2024-10-02 18:37 UTC (permalink / raw) To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty On 9/30/24 11:13 AM, Kanchan Joshi wrote: > Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host > to control the placement of logical blocks so as to reduce the SSD WAF. > > Userspace can send the data lifetime information using the write hints. > The SCSI driver (sd) can already pass this information to the SCSI > devices. This patch does the same for NVMe. > > Fetch the placement-identifiers if the device supports FDP. > The incoming write-hint is mapped to a placement-identifier, which in > turn is set in the DSPEC field of the write command. Is the description of this patch correct? The above description suggests that FDP information is similar in nature to SCSI data lifetime information. I don't think that's correct. Thanks, Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 1/3] nvme: enable FDP support 2024-10-02 18:37 ` Bart Van Assche @ 2024-10-03 12:55 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw) To: Bart Van Assche Cc: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Hui Qi, Nitesh Shetty On Wed, Oct 02, 2024 at 11:37:11AM -0700, Bart Van Assche wrote: > Is the description of this patch correct? The above description suggests > that FDP information is similar in nature to SCSI data lifetime > information. I don't think that's correct. As you correctly point out, it is entirely incorrect. ^ permalink raw reply [flat|nested] 86+ messages in thread
[parent not found: <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com>]
* [PATCH v7 2/3] block, fs: restore kiocb based write hint processing [not found] ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com> @ 2024-09-30 18:13 ` Kanchan Joshi 0 siblings, 0 replies; 86+ messages in thread From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw) To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53 ("fs: remove kiocb.ki_hint"). But write hint has made a comeback with commit 449813515d3e ("block, fs: Restore the per-bio/request data lifetime fields"). This patch uses the leftover space in kiocb to carve 1 byte field ki_write_hint. Restore the code that operates on kiocb to use ki_write_hint instead of inode hint value. This does not bring any behavior change, but needed to enable per-io hints (by another patch). Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> --- block/fops.c | 6 +++--- fs/aio.c | 1 + fs/cachefiles/io.c | 1 + fs/direct-io.c | 2 +- fs/iomap/direct-io.c | 2 +- include/linux/fs.h | 8 ++++++++ io_uring/rw.c | 1 + 7 files changed, 16 insertions(+), 5 deletions(-) diff --git a/block/fops.c b/block/fops.c index e696ae53bf1e..85b9b97d372c 100644 --- a/block/fops.c +++ b/block/fops.c @@ -74,7 +74,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb)); } bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio.bi_write_hint = iocb->ki_write_hint; bio.bi_ioprio = iocb->ki_ioprio; if (iocb->ki_flags & IOCB_ATOMIC) bio.bi_opf |= REQ_ATOMIC; @@ -203,7 +203,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, for (;;) { bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = iocb->ki_write_hint; bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; bio->bi_ioprio = iocb->ki_ioprio; @@ -319,7 +319,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, dio->flags = 0; dio->iocb = iocb; bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; - bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = iocb->ki_write_hint; bio->bi_end_io = blkdev_bio_end_io_async; bio->bi_ioprio = iocb->ki_ioprio; diff --git a/fs/aio.c b/fs/aio.c index e8920178b50f..db618817e670 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1517,6 +1517,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type) req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW; if (iocb->aio_flags & IOCB_FLAG_RESFD) req->ki_flags |= IOCB_EVENTFD; + req->ki_write_hint = file_write_hint(req->ki_filp); if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { /* * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c index 6a821a959b59..c3db102ae64e 100644 --- a/fs/cachefiles/io.c +++ b/fs/cachefiles/io.c @@ -309,6 +309,7 @@ int __cachefiles_write(struct cachefiles_object *object, ki->iocb.ki_pos = start_pos; ki->iocb.ki_flags = IOCB_DIRECT | IOCB_WRITE; ki->iocb.ki_ioprio = get_current_ioprio(); + ki->iocb.ki_write_hint = file_write_hint(file); ki->object = object; ki->start = start_pos; ki->len = len; diff --git a/fs/direct-io.c b/fs/direct-io.c index bbd05f1a2145..73629e26becb 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, bio->bi_end_io = dio_bio_end_io; if (dio->is_pinned) bio_set_flag(bio, BIO_PAGE_PINNED); - bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint; + bio->bi_write_hint = dio->iocb->ki_write_hint; sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index f637aa0706a3..fff43f121ee6 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); bio->bi_iter.bi_sector = iomap_sector(iomap, pos); - bio->bi_write_hint = inode->i_write_hint; + bio->bi_write_hint = dio->iocb->ki_write_hint; bio->bi_ioprio = dio->iocb->ki_ioprio; bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; diff --git a/include/linux/fs.h b/include/linux/fs.h index e3c603d01337..3dfe6de7b611 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -370,6 +370,7 @@ struct kiocb { void *private; int ki_flags; u16 ki_ioprio; /* See linux/ioprio.h */ + enum rw_hint ki_write_hint; union { /* * Only used for async buffered reads, where it denotes the @@ -2337,12 +2338,18 @@ static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap, !vfsgid_valid(i_gid_into_vfsgid(idmap, inode)); } +static inline enum rw_hint file_write_hint(struct file *filp) +{ + return file_inode(filp)->i_write_hint; +} + static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) { *kiocb = (struct kiocb) { .ki_filp = filp, .ki_flags = filp->f_iocb_flags, .ki_ioprio = get_current_ioprio(), + .ki_write_hint = file_write_hint(filp), }; } @@ -2353,6 +2360,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, .ki_filp = filp, .ki_flags = kiocb_src->ki_flags, .ki_ioprio = kiocb_src->ki_ioprio, + .ki_write_hint = kiocb_src->ki_write_hint, .ki_pos = kiocb_src->ki_pos, }; } diff --git a/io_uring/rw.c b/io_uring/rw.c index f023ff49c688..510123d3d837 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1023,6 +1023,7 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); + rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); if (force_nonblock) { /* If the file doesn't support async, just async punt */ -- 2.25.1 ^ permalink raw reply related [flat|nested] 86+ messages in thread
[parent not found: <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com>]
* [PATCH v7 3/3] io_uring: enable per-io hinting capability [not found] ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com> @ 2024-09-30 18:13 ` Kanchan Joshi 2024-10-02 14:26 ` Pavel Begunkov 2024-10-02 18:29 ` Bart Van Assche 0 siblings, 2 replies; 86+ messages in thread From: Kanchan Joshi @ 2024-09-30 18:13 UTC (permalink / raw) To: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Kanchan Joshi, Nitesh Shetty With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and all the subsequent writes on the file pass that hint value down. This can be limiting for large files (and for block device) as all the writes can be tagged with only one lifetime hint value. Concurrent writes (with different hint values) are hard to manage. Per-IO hinting solves that problem. Allow userspace to pass additional metadata in the SQE. The type of passed metadata is expressed by a new field __u16 meta_type; At this point one type META_TYPE_LIFETIME_HINT is supported. With this type, user can pass lifetime hint values in the new field __u64 lifetime_val; This accepts all lifetime hint values that are possible with F_SET_RW_HINT fcntl. The write handlers (io_prep_rw, io_write) send the hint value to lower-layer using kiocb. This is good for upporting direct IO, but not when kiocb is not available (e.g., buffered IO). When per-io hints are not passed, the per-inode hint values are set in the kiocb (as before). Otherwise, these take the precedence on per-inode hints. Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> --- fs/fcntl.c | 22 ---------------------- include/linux/rw_hint.h | 24 ++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 19 +++++++++++++++++++ io_uring/rw.c | 25 ++++++++++++++++++++++++- 4 files changed, 67 insertions(+), 23 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index 22dd9dcce7ec..a390a05f4ef8 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) } #endif -static bool rw_hint_valid(u64 hint) -{ - BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); - BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); - BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); - BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); - BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); - BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); - - switch (hint) { - case RWH_WRITE_LIFE_NOT_SET: - case RWH_WRITE_LIFE_NONE: - case RWH_WRITE_LIFE_SHORT: - case RWH_WRITE_LIFE_MEDIUM: - case RWH_WRITE_LIFE_LONG: - case RWH_WRITE_LIFE_EXTREME: - return true; - default: - return false; - } -} - static long fcntl_get_rw_hint(struct file *file, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h index 309ca72f2dfb..f4373a71ffed 100644 --- a/include/linux/rw_hint.h +++ b/include/linux/rw_hint.h @@ -21,4 +21,28 @@ enum rw_hint { static_assert(sizeof(enum rw_hint) == 1); #endif +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) + +static inline bool rw_hint_valid(u64 hint) +{ + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); + + switch (hint) { + case RWH_WRITE_LIFE_NOT_SET: + case RWH_WRITE_LIFE_NONE: + case RWH_WRITE_LIFE_SHORT: + case RWH_WRITE_LIFE_MEDIUM: + case RWH_WRITE_LIFE_LONG: + case RWH_WRITE_LIFE_EXTREME: + return true; + default: + return false; + } +} + #endif /* _LINUX_RW_HINT_H */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 86cb385fe0b5..951e35226229 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -92,12 +92,23 @@ struct io_uring_sqe { __u16 addr_len; __u16 __pad3[1]; }; + struct { + /* Bit field to express 16 meta types */ + __u16 meta_type; + __u16 __pad4[1]; + }; }; union { struct { __u64 addr3; __u64 __pad2[1]; }; + struct { + /* First meta type specific fields */ + __u64 lifetime_val; + /* For future use */ + __u64 __pad5[1]; + }; __u64 optval; /* * If the ring is initialized with IORING_SETUP_SQE128, then @@ -107,6 +118,14 @@ struct io_uring_sqe { }; }; +enum io_uring_sqe_meta_type_bits { + META_TYPE_LIFETIME_HINT_BIT +}; + +/* this meta type covers write hint values supported by F_SET_RW_HINT fcntl */ +#define META_TYPE_LIFETIME_HINT (1U << META_TYPE_LIFETIME_HINT_BIT) + + /* * If sqe->file_index is set to this for opcodes that instantiate a new * direct descriptor (like openat/openat2/accept), then io_uring will allocate diff --git a/io_uring/rw.c b/io_uring/rw.c index 510123d3d837..bf45ee8904a4 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -269,6 +269,24 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + if (ddir == ITER_SOURCE) { + u16 mtype = READ_ONCE(sqe->meta_type); + + rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID; + if (mtype) { + u64 lhint = READ_ONCE(sqe->lifetime_val); + + if (READ_ONCE(sqe->__pad4[0]) || + READ_ONCE(sqe->__pad5[0])) + return -EINVAL; + + if (mtype != META_TYPE_LIFETIME_HINT || + !rw_hint_valid(lhint)) + return -EINVAL; + + rw->kiocb.ki_write_hint = lhint; + } + } rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); @@ -1023,7 +1041,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); - rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); + /* + * Use per-file hint only if per-io hint is not set. + * We need per-io hint to get precedence. + */ + if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID) + rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); if (force_nonblock) { /* If the file doesn't support async, just async punt */ -- 2.25.1 ^ permalink raw reply related [flat|nested] 86+ messages in thread
* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability 2024-09-30 18:13 ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi @ 2024-10-02 14:26 ` Pavel Begunkov 2024-10-17 14:58 ` Kanchan Joshi 2024-10-02 18:29 ` Bart Van Assche 1 sibling, 1 reply; 86+ messages in thread From: Pavel Begunkov @ 2024-10-02 14:26 UTC (permalink / raw) To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Nitesh Shetty On 9/30/24 19:13, Kanchan Joshi wrote: > With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and > all the subsequent writes on the file pass that hint value down. > This can be limiting for large files (and for block device) as all the > writes can be tagged with only one lifetime hint value. > Concurrent writes (with different hint values) are hard to manage. > Per-IO hinting solves that problem. > > Allow userspace to pass additional metadata in the SQE. > The type of passed metadata is expressed by a new field > > __u16 meta_type; The new layout looks nicer, but let me elaborate on the previous comment. I don't believe we should be restricting to only one attribute per IO. What if someone wants to pass a lifetime hint together with integrity information? Instead, we might need something more extensible like an ability to pass a list / array of typed attributes / meta information / hints etc. An example from networking I gave last time was control messages, i.e. cmsg. In a basic oversimplified form the API from the user perspective could look like: struct meta_attr { u16 type; u64 data; }; struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}}; sqe->meta_attrs = attr; sqe->meta_nr = 2; I'm pretty sure there will be a bunch of considerations people will name the API should account for, like sizing the struct so it can fit integrity bits or even making it variable sized. > At this point one type META_TYPE_LIFETIME_HINT is supported. > With this type, user can pass lifetime hint values in the new field > > __u64 lifetime_val; > > This accepts all lifetime hint values that are possible with > F_SET_RW_HINT fcntl. > > The write handlers (io_prep_rw, io_write) send the hint value to > lower-layer using kiocb. This is good for upporting direct IO, > but not when kiocb is not available (e.g., buffered IO). > > When per-io hints are not passed, the per-inode hint values are set in > the kiocb (as before). Otherwise, these take the precedence on per-inode > hints. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability 2024-10-02 14:26 ` Pavel Begunkov @ 2024-10-17 14:58 ` Kanchan Joshi 0 siblings, 0 replies; 86+ messages in thread From: Kanchan Joshi @ 2024-10-17 14:58 UTC (permalink / raw) To: Pavel Begunkov, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Nitesh Shetty On 10/2/2024 7:56 PM, Pavel Begunkov wrote: > On 9/30/24 19:13, Kanchan Joshi wrote: >> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and >> all the subsequent writes on the file pass that hint value down. >> This can be limiting for large files (and for block device) as all the >> writes can be tagged with only one lifetime hint value. >> Concurrent writes (with different hint values) are hard to manage. >> Per-IO hinting solves that problem. >> >> Allow userspace to pass additional metadata in the SQE. >> The type of passed metadata is expressed by a new field >> >> __u16 meta_type; > > The new layout looks nicer, but let me elaborate on the previous > comment. I don't believe we should be restricting to only one > attribute per IO. What if someone wants to pass a lifetime hint > together with integrity information? For that reason only I made meta_type to accept multiple bit values. META_TYPE_LIFETIME_HINT and a new META_TYPE_INTEGRITY can coexist. Overall 16 meta types can coexist. > Instead, we might need something more extensible like an ability > to pass a list / array of typed attributes / meta information / hints > etc. An example from networking I gave last time was control messages, > i.e. cmsg. In a basic oversimplified form the API from the user > perspective could look like: > > struct meta_attr { > u16 type; > u64 data; > }; > > struct meta_attr attr[] = {{HINT, hint_value}, {INTEGRITY, ptr}}; > sqe->meta_attrs = attr; > sqe->meta_nr = 2; I did not feel like adding a pointer (and have copy_from_user cost) for integrity. Currently integrity uses space in second SQE which seems fine [*]. Down the line if meta-types increase and we are on verge of low SQE space, we can resort to add indirect reference. [*] https://lore.kernel.org/linux-nvme/[email protected]/ ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 3/3] io_uring: enable per-io hinting capability 2024-09-30 18:13 ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi 2024-10-02 14:26 ` Pavel Begunkov @ 2024-10-02 18:29 ` Bart Van Assche 1 sibling, 0 replies; 86+ messages in thread From: Bart Van Assche @ 2024-10-02 18:29 UTC (permalink / raw) To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz, Nitesh Shetty On 9/30/24 11:13 AM, Kanchan Joshi wrote: > diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h > index 309ca72f2dfb..f4373a71ffed 100644 > --- a/include/linux/rw_hint.h > +++ b/include/linux/rw_hint.h > @@ -21,4 +21,28 @@ enum rw_hint { > static_assert(sizeof(enum rw_hint) == 1); > #endif > > +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) > + > +static inline bool rw_hint_valid(u64 hint) > +{ > + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); > + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); > + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); > + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); > + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); > + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); > + > + switch (hint) { > + case RWH_WRITE_LIFE_NOT_SET: > + case RWH_WRITE_LIFE_NONE: > + case RWH_WRITE_LIFE_SHORT: > + case RWH_WRITE_LIFE_MEDIUM: > + case RWH_WRITE_LIFE_LONG: > + case RWH_WRITE_LIFE_EXTREME: > + return true; > + default: > + return false; > + } > +} Moving a function that is not in the hot path from a .c file to a .h file is wrong. This increases the kernel size, slows down compilation and makes implementation details visible to callers that should not have access to these implementation details. Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi ` (2 preceding siblings ...) [not found] ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com> @ 2024-10-01 9:20 ` Christoph Hellwig 2024-10-01 15:58 ` James R. Bergsten ` (2 more replies) 2024-10-03 0:20 ` Bart Van Assche 2024-10-15 5:50 ` Christoph Hellwig 5 siblings, 3 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-01 9:20 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz Any reason you completely ignored my feedback on the last version and did not even answer? That's not a very productive way to work. ^ permalink raw reply [flat|nested] 86+ messages in thread
* RE: [PATCH v7 0/3] FDP and per-io hints 2024-10-01 9:20 ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig @ 2024-10-01 15:58 ` James R. Bergsten 2024-10-01 16:18 ` Jens Axboe 2024-10-01 16:23 ` Keith Busch 2 siblings, 0 replies; 86+ messages in thread From: James R. Bergsten @ 2024-10-01 15:58 UTC (permalink / raw) To: 'Christoph Hellwig', 'Kanchan Joshi' Cc: axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz Now THIS Is the NVMe I came to know and love. 😊 -----Original Message----- From: Linux-nvme <[email protected]> On Behalf Of Christoph Hellwig Sent: Tuesday, October 1, 2024 2:21 AM To: Kanchan Joshi <[email protected]> Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Subject: Re: [PATCH v7 0/3] FDP and per-io hints Any reason you completely ignored my feedback on the last version and did not even answer? That's not a very productive way to work. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-01 9:20 ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig 2024-10-01 15:58 ` James R. Bergsten @ 2024-10-01 16:18 ` Jens Axboe 2024-10-02 7:51 ` Christoph Hellwig 2024-10-01 16:23 ` Keith Busch 2 siblings, 1 reply; 86+ messages in thread From: Jens Axboe @ 2024-10-01 16:18 UTC (permalink / raw) To: Christoph Hellwig, Kanchan Joshi Cc: kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/1/24 3:20 AM, Christoph Hellwig wrote: > Any reason you completely ignored my feedback on the last version > and did not even answer? Probably because he got a little tired of dealing with the bullshit related to this topic. > That's not a very productive way to work. Have to say, that neither have your responses been. Can't really fault people for just giving up at some point, when no productive end seems to be in sight. As far as I'm concerned, this looks fine to me. There are customers wanting to use us, and folks making drives that support it. It's not right to continually gatekeep this feature just because you don't like it. I need to review the io_uring bits, but that should be the least of it. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-01 16:18 ` Jens Axboe @ 2024-10-02 7:51 ` Christoph Hellwig 2024-10-02 15:03 ` Jens Axboe 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-02 7:51 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote: > Have to say, that neither have your responses been. Can't really fault > people for just giving up at some point, when no productive end seems to > be in sight. I heavily disagree and take offence on that. The previous stream separation approach made total sense, but just needed a fair amount of work. But it closely matches how things work at the hardware and file system level, so it was the right approach. Suddenly dropping that and ignoring all comments really feels like someone urgently needs to pull a marketing stunt here. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 7:51 ` Christoph Hellwig @ 2024-10-02 15:03 ` Jens Axboe 2024-10-02 15:13 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Jens Axboe @ 2024-10-02 15:03 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/2/24 1:51 AM, Christoph Hellwig wrote: > On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote: >> Have to say, that neither have your responses been. Can't really fault >> people for just giving up at some point, when no productive end seems to >> be in sight. > > I heavily disagree and take offence on that. > > The previous stream separation approach made total sense, but just > needed a fair amount of work. But it closely matches how things work > at the hardware and file system level, so it was the right approach. What am I missing that makes this effort that different from streams? Both are a lifetime hint. > Suddenly dropping that and ignoring all comments really feels like > someone urgently needs to pull a marketing stunt here. I think someone is just trying to finally get this feature in, so it can get used by customers, after many months of fairly unproductive discussion. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:03 ` Jens Axboe @ 2024-10-02 15:13 ` Christoph Hellwig 2024-10-02 15:17 ` Keith Busch 2024-10-02 15:22 ` Jens Axboe 0 siblings, 2 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-02 15:13 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote: > > The previous stream separation approach made total sense, but just > > needed a fair amount of work. But it closely matches how things work > > at the hardware and file system level, so it was the right approach. > > What am I missing that makes this effort that different from streams? > Both are a lifetime hint. A stream has a lot less strings attached. But hey, don't make me argue for streams - you pushed the API in despite reservations back then, and we've learned a lot since. > > Suddenly dropping that and ignoring all comments really feels like > > someone urgently needs to pull a marketing stunt here. > > I think someone is just trying to finally get this feature in, so it > can get used by customers, after many months of fairly unproductive > discussion. Well, he finally started on the right approach and gave it up after the first round of feedback. That's just a bit weird. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:13 ` Christoph Hellwig @ 2024-10-02 15:17 ` Keith Busch 2024-10-02 15:19 ` Christoph Hellwig 2024-10-02 15:22 ` Jens Axboe 1 sibling, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-02 15:17 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote: > > Well, he finally started on the right approach and gave it up after the > first round of feedback. That's just a bit weird. Nothing prevents future improvements in that direction. It just seems out of scope for what Kanchan is trying to enable for his customer use cases. This patch looks harmless. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:17 ` Keith Busch @ 2024-10-02 15:19 ` Christoph Hellwig 2024-10-02 15:33 ` Keith Busch 2024-10-02 15:47 ` Martin K. Petersen 0 siblings, 2 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-02 15:19 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 09:17:35AM -0600, Keith Busch wrote: > On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote: > > > > Well, he finally started on the right approach and gave it up after the > > first round of feedback. That's just a bit weird. > > Nothing prevents future improvements in that direction. It just seems > out of scope for what Kanchan is trying to enable for his customer use > cases. This patch looks harmless. It's not really. Once we wire it up like this we mess up the ability to use the feature in other ways. Additionally the per-I/O hints are simply broken if you want a file system in the way as last time, and if I don't misremember you acknowledged in person last week. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:19 ` Christoph Hellwig @ 2024-10-02 15:33 ` Keith Busch 2024-10-03 12:51 ` Christoph Hellwig 2024-10-02 15:47 ` Martin K. Petersen 1 sibling, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-02 15:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 05:19:49PM +0200, Christoph Hellwig wrote: > > Nothing prevents future improvements in that direction. It just seems > > out of scope for what Kanchan is trying to enable for his customer use > > cases. This patch looks harmless. > > It's not really. Once we wire it up like this we mess up the ability > to use the feature in other ways. Additionally the per-I/O hints are > simply broken if you want a file system in the way as last time, > and if I don't misremember you acknowledged in person last week. You remember right. I also explained the use cases for per-io hints are to replace current passthrough users with generic read/write paths that have all the memory guard rails, read/write stats, and other features that you don't get with passthrough. This just lets you accomplish the same placement hints with the nvme raw block device that you get with the nvme char device. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:33 ` Keith Busch @ 2024-10-03 12:51 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-03 12:51 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 09:33:16AM -0600, Keith Busch wrote: > You remember right. I also explained the use cases for per-io hints are > to replace current passthrough users with generic read/write paths that > have all the memory guard rails, read/write stats, and other features > that you don't get with passthrough. This just lets you accomplish the > same placement hints with the nvme raw block device that you get with > the nvme char device. Yes, and as I said before we need an opt-in from the file_ops, because file systems can't support this. And I'm speaking that from a position of working on a file system that made the existing non-idea hints work (to all fairness Hans did the hard work, but I was part of it). ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:19 ` Christoph Hellwig 2024-10-02 15:33 ` Keith Busch @ 2024-10-02 15:47 ` Martin K. Petersen 2024-10-02 18:34 ` Bart Van Assche 2024-10-03 12:54 ` Christoph Hellwig 1 sibling, 2 replies; 86+ messages in thread From: Martin K. Petersen @ 2024-10-02 15:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz Christoph, >> Nothing prevents future improvements in that direction. It just seems >> out of scope for what Kanchan is trying to enable for his customer use >> cases. This patch looks harmless. > > It's not really. Once we wire it up like this we mess up the ability > to use the feature in other ways. Additionally the per-I/O hints are > simply broken if you want a file system Here is my take: It is the kernel's job to manage the system's hardware resources and arbitrate and share these resources optimally and fairly between all running applications. What irks me is defining application interfaces which fundamentally tell the kernel that "these blocks are part of the same file". The kernel already knows this. It is the very entity which provides that abstraction. Why do we need an explicit interface to inform the kernel that concurrent writes to the same file should have the same "temperature" or need to go to the same "bin" on the storage device? Shouldn't that just happen automatically? Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like we are consistently failing to deliver something that actually works for anything but a few specialized corner cases. I think that is a shame. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:47 ` Martin K. Petersen @ 2024-10-02 18:34 ` Bart Van Assche 2024-10-03 12:55 ` Christoph Hellwig 2024-10-03 12:54 ` Christoph Hellwig 1 sibling, 1 reply; 86+ messages in thread From: Bart Van Assche @ 2024-10-02 18:34 UTC (permalink / raw) To: Martin K. Petersen, Christoph Hellwig Cc: Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/2/24 8:47 AM, Martin K. Petersen wrote: > What irks me is defining application interfaces which fundamentally tell > the kernel that "these blocks are part of the same file". Isn't FDP about communicating much more than only this information to the block device, e.g. information about reclaim units? Although I'm personally not interested in FDP, my colleagues were involved in the standardization of FDP. Thanks, Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 18:34 ` Bart Van Assche @ 2024-10-03 12:55 ` Christoph Hellwig 2024-10-03 21:48 ` Keith Busch 2024-10-04 6:21 ` Javier González 0 siblings, 2 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-03 12:55 UTC (permalink / raw) To: Bart Van Assche Cc: Martin K. Petersen, Christoph Hellwig, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: > Isn't FDP about communicating much more than only this information to > the block device, e.g. information about reclaim units? Although I'm > personally not interested in FDP, my colleagues were involved in the > standardization of FDP. Yes, it is. And when I explained how to properly export this kind of information can be implemented on top of the version Kanchan sent everyone suddenly stopped diskussion technical points and went either silent or all political. So I think some peoples bonuses depend on not understanding the problem I fear :( ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 12:55 ` Christoph Hellwig @ 2024-10-03 21:48 ` Keith Busch 2024-10-03 22:00 ` Bart Van Assche 2024-10-04 6:21 ` Javier González 1 sibling, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-03 21:48 UTC (permalink / raw) To: Christoph Hellwig Cc: Bart Van Assche, Martin K. Petersen, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 03, 2024 at 02:55:16PM +0200, Christoph Hellwig wrote: > On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: > > Isn't FDP about communicating much more than only this information to > > the block device, e.g. information about reclaim units? Although I'm > > personally not interested in FDP, my colleagues were involved in the > > standardization of FDP. > > Yes, it is. And when I explained how to properly export this kind of > information can be implemented on top of the version Kanchan sent everyone > suddenly stopped diskussion technical points and went either silent or > all political. The nominals can mean whatever you want. If you want it to mean "temperature", then that's what it means. If you want it to mean something else, then don't use this. These are hints at the end of the day. Nothing locks the kernel into this if a better solution develops. As you know, it was ripped out before. > So I think some peoples bonuses depend on not understanding the problem > I fear :( The only "bonus" I have is not repeatedly explaining why people can't use h/w features the way they want. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 21:48 ` Keith Busch @ 2024-10-03 22:00 ` Bart Van Assche 2024-10-03 22:12 ` Jens Axboe 2024-10-03 22:17 ` Keith Busch 0 siblings, 2 replies; 86+ messages in thread From: Bart Van Assche @ 2024-10-03 22:00 UTC (permalink / raw) To: Keith Busch, Christoph Hellwig Cc: Martin K. Petersen, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/3/24 2:48 PM, Keith Busch wrote: > The only "bonus" I have is not repeatedly explaining why people can't > use h/w features the way they want. Hi Keith, Although that's a fair argument, what are the use cases for this patch series? Filesystems in the kernel? Filesystems implemented in user space? Perhaps something else? This patch series adds new a new user space interface for passing hints to storage devices (in io_uring). As we all know such interfaces are hard to remove once these have been added. We don't need new user space interfaces to support FDP for filesystems in the kernel. For filesystems implemented in user space, would using NVMe pass-through be a viable approach? With this approach, no new user space interfaces have to be added. I'm wondering how to unblock FDP users without adding a new controversial mechanism in the kernel. Thanks, Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 22:00 ` Bart Van Assche @ 2024-10-03 22:12 ` Jens Axboe 2024-10-03 22:17 ` Keith Busch 1 sibling, 0 replies; 86+ messages in thread From: Jens Axboe @ 2024-10-03 22:12 UTC (permalink / raw) To: Bart Van Assche, Keith Busch, Christoph Hellwig Cc: Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/3/24 4:00 PM, Bart Van Assche wrote: > On 10/3/24 2:48 PM, Keith Busch wrote: >> The only "bonus" I have is not repeatedly explaining why people can't >> use h/w features the way they want. > > Hi Keith, > > Although that's a fair argument, what are the use cases for this patch > series? Filesystems in the kernel? Filesystems implemented in user > space? Perhaps something else? > > This patch series adds new a new user space interface for passing hints > to storage devices (in io_uring). As we all know such interfaces are > hard to remove once these have been added. It's a _hint_, I'm not sure how many times that has to be stated. The kernel is free to ignore it, and in the future, it may very well do that. We already had fcntl interfaces for streams, in the same vein, and we yanked those in the sense that they ended up doing _nothing_. Did things break because of that? Of course not. This is no different. > We don't need new user space interfaces to support FDP for filesystems > in the kernel. > > For filesystems implemented in user space, would using NVMe pass-through > be a viable approach? With this approach, no new user space interfaces > have to be added. > > I'm wondering how to unblock FDP users without adding a new > controversial mechanism in the kernel. pass-through already works, obviously - this is more about adding a viable real interface for it. If it's a feature that is in devices AND customers want to use, then pointing them at pass-through is a cop-out. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 22:00 ` Bart Van Assche 2024-10-03 22:12 ` Jens Axboe @ 2024-10-03 22:17 ` Keith Busch 1 sibling, 0 replies; 86+ messages in thread From: Keith Busch @ 2024-10-03 22:17 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Martin K. Petersen, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 03, 2024 at 03:00:21PM -0700, Bart Van Assche wrote: > On 10/3/24 2:48 PM, Keith Busch wrote: > > The only "bonus" I have is not repeatedly explaining why people can't > > use h/w features the way they want. > > Hi Keith, > > Although that's a fair argument, what are the use cases for this patch > series? Filesystems in the kernel? Filesystems implemented in user > space? Perhaps something else? > > This patch series adds new a new user space interface for passing hints > to storage devices (in io_uring). As we all know such interfaces are > hard to remove once these have been added. > > We don't need new user space interfaces to support FDP for filesystems > in the kernel. > > For filesystems implemented in user space, would using NVMe pass-through > be a viable approach? With this approach, no new user space interfaces > have to be added. > > I'm wondering how to unblock FDP users without adding a new > controversial mechanism in the kernel. I really like the passthrough interface. This is flexible and enables experimenting with all sorts of features. But it exposes exploits and pitfalls. Just check the CVE's against it (ex: CVE-2023-6238, note: same problem exists without metadata). The consequences of misuse include memory corruption and private memory leakage. The barriers for exploits are too low, and the potential for mistakes are too high. A possible h/w solution requires an IOMMU, but that's a non-starter for many users and only solves the private memory vulnerabilities. Passthrough also skips all the block layer features. I just today added statistics to it staged for 6.13, but even that requires yet another config step to enable it. tl;dr: no one wants to use nvme passthrough in production long term. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 12:55 ` Christoph Hellwig 2024-10-03 21:48 ` Keith Busch @ 2024-10-04 6:21 ` Javier González 2024-10-04 6:24 ` Christoph Hellwig 1 sibling, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-04 6:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 03.10.2024 14:55, Christoph Hellwig wrote: >On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: >> Isn't FDP about communicating much more than only this information to >> the block device, e.g. information about reclaim units? Although I'm >> personally not interested in FDP, my colleagues were involved in the >> standardization of FDP. > >Yes, it is. And when I explained how to properly export this kind of >information can be implemented on top of the version Kanchan sent everyone >suddenly stopped diskussion technical points and went either silent or >all political. > >So I think some peoples bonuses depend on not understanding the problem >I fear :( > Please, don't. Childish comments like this delegitimize the work that a lot of people are doing in Linux. We all operate under the assumption that folks here know how to wear two hants, and that there is no such thing as so specific incentives neither to push _nor_ block upstream contributions without a use-case and technical background. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:21 ` Javier González @ 2024-10-04 6:24 ` Christoph Hellwig 2024-10-04 6:59 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-04 6:24 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote: >> So I think some peoples bonuses depend on not understanding the problem >> I fear :( >> > > Please, don't. > > Childish comments like this delegitimize the work that a lot of people > are doing in Linux. It's the only very sarcastic explanation I can come up with to explain this discussion, where people from the exactly two companies where this might be bonus material love political discussion and drop dead when it turns technical. I'm probably wrong, but I've run out of other explanations. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:24 ` Christoph Hellwig @ 2024-10-04 6:59 ` Javier González 2024-10-04 12:32 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-04 6:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 04.10.2024 08:24, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote: >>> So I think some peoples bonuses depend on not understanding the problem >>> I fear :( >>> >> >> Please, don't. >> >> Childish comments like this delegitimize the work that a lot of people >> are doing in Linux. > >It's the only very sarcastic explanation I can come up with to explain >this discussion, where people from the exactly two companies where this >might be bonus material love political discussion and drop dead when it >turns technical. FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, Microship, Marvell, FADU, WDC, and Samsung. The fact that 2 of these companies are the ones starting to build the Linux ecosystem should not surprise you, as it is the way things work normally. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:59 ` Javier González @ 2024-10-04 12:32 ` Christoph Hellwig 2024-10-07 11:29 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-04 12:32 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote: > FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, > Microship, Marvell, FADU, WDC, and Samsung. > > The fact that 2 of these companies are the ones starting to build the > Linux ecosystem should not surprise you, as it is the way things work > normally. That's not the point. There is one company that drivers entirely pointless marketing BS, and that one is pretty central here. The same company that said FDP has absolutely no іntent to work on Linux and fought my initial attempt to make the protocol not totally unusable ony layer system. And no, that's not Samsung. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 12:32 ` Christoph Hellwig @ 2024-10-07 11:29 ` Javier González 2024-10-08 12:27 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-07 11:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 04.10.2024 14:32, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote: >> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, >> Microship, Marvell, FADU, WDC, and Samsung. >> >> The fact that 2 of these companies are the ones starting to build the >> Linux ecosystem should not surprise you, as it is the way things work >> normally. > >That's not the point. There is one company that drivers entirely pointless >marketing BS, and that one is pretty central here. The same company >that said FDP has absolutely no іntent to work on Linux and fought my >initial attempt to make the protocol not totally unusable ony layer system. >And no, that's not Samsung. So you had an interaction in the working group, your feedback was not taking into consideration by the authors, and the result is that FDP cannot be supported in Linux as a consequence of that? Come on... ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-07 11:29 ` Javier González @ 2024-10-08 12:27 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-08 12:27 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Bart Van Assche, Martin K. Petersen, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Mon, Oct 07, 2024 at 01:29:31PM +0200, Javier González wrote: >> That's not the point. There is one company that drivers entirely pointless >> marketing BS, and that one is pretty central here. The same company >> that said FDP has absolutely no іntent to work on Linux and fought my >> initial attempt to make the protocol not totally unusable ony layer system. >> And no, that's not Samsung. > > So you had an interaction in the working group, your feedback was not > taking into consideration by the authors, and the result is that FDP > cannot be supported in Linux as a consequence of that? Come on... No, what I am saying is that the "small" FDP group that was doing the development while keeping it doing away from the group insisted that FDP is only for use in userspace drivers, and even getting the basis in to properly make it suitable for a semi-multi tenant setup like Linux io_uring passthrough was not welcome and actually fought tooth and nail by the particular one particular company. It a long talk to the head of the NVMe board to even get this sorted out. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:47 ` Martin K. Petersen 2024-10-02 18:34 ` Bart Van Assche @ 2024-10-03 12:54 ` Christoph Hellwig 2024-10-03 22:14 ` Jens Axboe 1 sibling, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-03 12:54 UTC (permalink / raw) To: Martin K. Petersen Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 11:47:32AM -0400, Martin K. Petersen wrote: > It is the kernel's job to manage the system's hardware resources and > arbitrate and share these resources optimally and fairly between all > running applications. Exactly. > What irks me is defining application interfaces which fundamentally tell > the kernel that "these blocks are part of the same file". > > The kernel already knows this. It is the very entity which provides that > abstraction. Why do we need an explicit interface to inform the kernel > that concurrent writes to the same file should have the same > "temperature" or need to go to the same "bin" on the storage device? > Shouldn't that just happen automatically? For file: yes. The problem is when you have more files than buckets on the device or file systems. Typical enterprise SSDs support somewhere between 8 and 16 write streams, and there typically is more data than that. So trying to group it somehow is good idea as not all files can have their own bucket. Allowing this inside a file like done in this patch set on the other hand is pretty crazy. > Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like > we are consistently failing to deliver something that actually works for > anything but a few specialized corner cases. I think that is a shame. Yes, it is. And as someone who has been sitting in this group that is because it's always someone in a place of power forcing down their version because they got a promotіon, best of show award or whatever. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 12:54 ` Christoph Hellwig @ 2024-10-03 22:14 ` Jens Axboe 2024-10-04 5:31 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Jens Axboe @ 2024-10-03 22:14 UTC (permalink / raw) To: Christoph Hellwig, Martin K. Petersen Cc: Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/3/24 6:54 AM, Christoph Hellwig wrote: > For file: yes. The problem is when you have more files than buckets on > the device or file systems. Typical enterprise SSDs support somewhere > between 8 and 16 write streams, and there typically is more data than > that. So trying to group it somehow is good idea as not all files can > have their own bucket. > > Allowing this inside a file like done in this patch set on the other > hand is pretty crazy. I do agree that per-file hints are not ideal. In the spirit of making some progress, how about we just retain per-io hints initially? We can certainly make that work over dio. Yes buffered IO won't work initially, but at least we're getting somewhere. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-03 22:14 ` Jens Axboe @ 2024-10-04 5:31 ` Christoph Hellwig 2024-10-04 6:18 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-04 5:31 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote: > On 10/3/24 6:54 AM, Christoph Hellwig wrote: > > For file: yes. The problem is when you have more files than buckets on > > the device or file systems. Typical enterprise SSDs support somewhere > > between 8 and 16 write streams, and there typically is more data than > > that. So trying to group it somehow is good idea as not all files can > > have their own bucket. > > > > Allowing this inside a file like done in this patch set on the other > > hand is pretty crazy. > > I do agree that per-file hints are not ideal. In the spirit of making > some progress, how about we just retain per-io hints initially? We can > certainly make that work over dio. Yes buffered IO won't work initially, > but at least we're getting somewhere. Huh? Per I/O hints at the syscall level are the problem (see also the reply from Martin). Per file make total sense, but we need the file system in control. The real problem is further down the stack. For the SCSI temperature hints just passing them on make sense. But when you map to some kind of stream separation in the device, no matter if that is streams, FDP, or various kinds of streams we don't even support in thing like CF and SDcard, the driver is not the right place to map temperature hint to streams. The requires some kind of intelligence. It could be dirt simple and just do a best effort mapping of the temperature hints 1:1 to separate write streams, or do a little mapping if there is not enough of them which should work fine for a raw block device. But one we have a file system things get more complicated: - the file system will want it's own streams for metadata and GC - even with that on beefy enough hardware you can have more streams then temperature levels, and the file system can and should do intelligen placement (based usually on files) Or to summarize: the per-file temperature hints make sense as a user interface. Per-I/O hints tend to be really messy at least if a file system is involved. Placing the temperatures to separate write streams in the driver does not scale even to the most trivial write stream aware file system implementations. And for anyone who followed the previous discussions of the patches none of this should been new, each point has been made at least three times before. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 5:31 ` Christoph Hellwig @ 2024-10-04 6:18 ` Javier González 2024-10-04 6:27 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-04 6:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 04.10.2024 07:31, Christoph Hellwig wrote: >On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote: >> On 10/3/24 6:54 AM, Christoph Hellwig wrote: >> > For file: yes. The problem is when you have more files than buckets on >> > the device or file systems. Typical enterprise SSDs support somewhere >> > between 8 and 16 write streams, and there typically is more data than >> > that. So trying to group it somehow is good idea as not all files can >> > have their own bucket. >> > >> > Allowing this inside a file like done in this patch set on the other >> > hand is pretty crazy. >> >> I do agree that per-file hints are not ideal. In the spirit of making >> some progress, how about we just retain per-io hints initially? We can >> certainly make that work over dio. Yes buffered IO won't work initially, >> but at least we're getting somewhere. > >Huh? Per I/O hints at the syscall level are the problem (see also the >reply from Martin). Per file make total sense, but we need the file >system in control. > >The real problem is further down the stack. For the SCSI temperature >hints just passing them on make sense. But when you map to some kind >of stream separation in the device, no matter if that is streams, FDP, >or various kinds of streams we don't even support in thing like CF >and SDcard, the driver is not the right place to map temperature hint >to streams. The requires some kind of intelligence. It could be >dirt simple and just do a best effort mapping of the temperature >hints 1:1 to separate write streams, or do a little mapping if there >is not enough of them which should work fine for a raw block device. > >But one we have a file system things get more complicated: > > - the file system will want it's own streams for metadata and GC > - even with that on beefy enough hardware you can have more streams > then temperature levels, and the file system can and should > do intelligen placement (based usually on files) > >Or to summarize: the per-file temperature hints make sense as a user >interface. Per-I/O hints tend to be really messy at least if a file >system is involved. Placing the temperatures to separate write streams >in the driver does not scale even to the most trivial write stream >> >And for anyone who followed the previous discussions of the patches >none of this should been new, each point has been made at least three >times before. Looking at the work you and Hans have been doing on XFS, it seems you have been successful at mapping the semantics of the temperature to zones (which has no semantics, just as FDP). What is the difference between the mapping in zones and for FDP? The whole point is using an existing interface to cover the use-case of people wanting hints in block. If you have a use-case for enabling hints on file, let's work on this aftterwards. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:18 ` Javier González @ 2024-10-04 6:27 ` Christoph Hellwig 2024-10-04 6:52 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-04 6:27 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote: >> And for anyone who followed the previous discussions of the patches >> none of this should been new, each point has been made at least three >> times before. > > Looking at the work you and Hans have been doing on XFS, it seems you > have been successful at mapping the semantics of the temperature to > zones (which has no semantics, just as FDP). > > What is the difference between the mapping in zones and for FDP? Probably not much, except for all the pitfalls in the FDP not quite hint not madatory design. > The whole point is using an existing interface to cover the use-case of > people wanting hints in block. And that's fine. And that point all the way back for month is that doing a complete dumb mapping in the driver for that is fundamentally wrong. Kanchan's previous series did about 50% of the work to get it rid, but then everyone dropped dead and played politics. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:27 ` Christoph Hellwig @ 2024-10-04 6:52 ` Javier González 2024-10-04 12:30 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-04 6:52 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 04.10.2024 08:27, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote: >>> And for anyone who followed the previous discussions of the patches >>> none of this should been new, each point has been made at least three >>> times before. >> >> Looking at the work you and Hans have been doing on XFS, it seems you >> have been successful at mapping the semantics of the temperature to >> zones (which has no semantics, just as FDP). >> >> What is the difference between the mapping in zones and for FDP? > >Probably not much, except for all the pitfalls in the FDP not quite hint >not madatory design. It is too late to change the first version of FDP productas, and the solution of "go back to NVMe and make it match SCSI" is not realistic. We have drives from several companies out there supporting this version of FDP. And we have customers that are actively using them. It is our collective responsibility to support them. So, considerign that file system _are_ able to use temperature hints and actually make them work, why don't we support FDP the same way we are supporting zones so that people can use it in production? I agree that down the road, an interface that allows hints (many more than 5!) is needed. And in my opinion, this interface should not have semintics attached to it, just a hint ID, #hints, and enough space to put 100s of them to support storage node deployments. But this needs to come from the users of the hints / zones / streams / etc, not from us vendors. We do not have neither details on how they deploy these features at scale, nor the workloads to validate the results. Anything else will probably just continue polluting the storage stack with more interfaces that are not used and add to the problem of data placement fragmentation. > >> The whole point is using an existing interface to cover the use-case of >> people wanting hints in block. > >And that's fine. And that point all the way back for month is that >doing a complete dumb mapping in the driver for that is fundamentally >wrong. Kanchan's previous series did about 50% of the work to get >it rid, but then everyone dropped dead and played politics. The issue is that the first series of this patch, which is as simple as it gets, hit the list in May. Since then we are down paths that lead nowhere. So the line between real technical feedback that leads to a feature being merged, and technical misleading to make people be a busy bee becomes very thin. In the whole data placement effort, we have been down this path many times, unfortunately... At LPC, we discussed about the work you did in XFS and it became clear that coming back to the first version of the patches was the easiest way to support the use-case that current customers have. It is a pity you did not attend the conference and could make a point against this line of thought there. So summarizing, all folks in this thread are asking is for a path to move forward for solving the block use-case, which is the only one that I am aware of requiring kernel support. This has been validated with real workloads, so it is very specific and tangible. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 6:52 ` Javier González @ 2024-10-04 12:30 ` Christoph Hellwig 2024-10-07 10:10 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-04 12:30 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > So, considerign that file system _are_ able to use temperature hints and > actually make them work, why don't we support FDP the same way we are > supporting zones so that people can use it in production? Because apparently no one has tried it. It should be possible in theory, but for example unless you have power of 2 reclaim unit size size it won't work very well with XFS where the AGs/RTGs must be power of two aligned in the LBA space, except by overprovisioning the LBA space vs the capacity actually used. > I agree that down the road, an interface that allows hints (many more > than 5!) is needed. And in my opinion, this interface should not have > semintics attached to it, just a hint ID, #hints, and enough space to > put 100s of them to support storage node deployments. But this needs to > come from the users of the hints / zones / streams / etc, not from > us vendors. We do not have neither details on how they deploy these > features at scale, nor the workloads to validate the results. Anything > else will probably just continue polluting the storage stack with more > interfaces that are not used and add to the problem of data placement > fragmentation. Please always mentioned what layer you are talking about. At the syscall level the temperatur hints are doing quite ok. A full stream separation would obviously be a lot better, as would be communicating the zone / reclaim unit / etc size. As an interface to a driver that doesn't natively speak temperature hint on the other hand it doesn't work at all. > The issue is that the first series of this patch, which is as simple as > it gets, hit the list in May. Since then we are down paths that lead > nowhere. So the line between real technical feedback that leads to > a feature being merged, and technical misleading to make people be a > busy bee becomes very thin. In the whole data placement effort, we have > been down this path many times, unfortunately... Well, the previous round was the first one actually trying to address the fundamental issue after 4 month. And then after a first round of feedback it gets shutdown somehow out of nowhere. As a maintainer and review that is the kinda of contributors I have a hard time taking serious. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-04 12:30 ` Christoph Hellwig @ 2024-10-07 10:10 ` Javier González 2024-10-08 10:06 ` Hans Holmberg 2024-10-08 12:25 ` Christoph Hellwig 0 siblings, 2 replies; 86+ messages in thread From: Javier González @ 2024-10-07 10:10 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 04.10.2024 14:30, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> So, considerign that file system _are_ able to use temperature hints and >> actually make them work, why don't we support FDP the same way we are >> supporting zones so that people can use it in production? > >Because apparently no one has tried it. It should be possible in theory, >but for example unless you have power of 2 reclaim unit size size it >won't work very well with XFS where the AGs/RTGs must be power of two >aligned in the LBA space, except by overprovisioning the LBA space vs >the capacity actually used. This is good. I think we should have at least a FS POC with data placement support to be able to drive conclusions on how the interface and requirements should be. Until we have that, we can support the use-cases that we know customers are asking for, i.e., block-level hints through the existing temperature API. > >> I agree that down the road, an interface that allows hints (many more >> than 5!) is needed. And in my opinion, this interface should not have >> semintics attached to it, just a hint ID, #hints, and enough space to >> put 100s of them to support storage node deployments. But this needs to >> come from the users of the hints / zones / streams / etc, not from >> us vendors. We do not have neither details on how they deploy these >> features at scale, nor the workloads to validate the results. Anything >> else will probably just continue polluting the storage stack with more >> interfaces that are not used and add to the problem of data placement >> fragmentation. > >Please always mentioned what layer you are talking about. At the syscall >level the temperatur hints are doing quite ok. A full stream separation >would obviously be a lot better, as would be communicating the zone / >reclaim unit / etc size. I mean at the syscall level. But as mentioned above, we need to be very sure that we have a clear use-case for that. If we continue seeing hints being use in NVMe or other protocols, and the number increase significantly, we can deal with it later on. > >As an interface to a driver that doesn't natively speak temperature >hint on the other hand it doesn't work at all. > >> The issue is that the first series of this patch, which is as simple as >> it gets, hit the list in May. Since then we are down paths that lead >> nowhere. So the line between real technical feedback that leads to >> a feature being merged, and technical misleading to make people be a >> busy bee becomes very thin. In the whole data placement effort, we have >> been down this path many times, unfortunately... > >Well, the previous round was the first one actually trying to address the >fundamental issue after 4 month. And then after a first round of feedback >it gets shutdown somehow out of nowhere. As a maintainer and review that >is the kinda of contributors I have a hard time taking serious. I am not sure I understand what you mean in the last sentece, so I will not respond filling blanks with a bad interpretation. In summary, what we are asking for is to take the patches that cover the current use-case, and work together on what might be needed for better FS support. For this, it seems you and Hans have a good idea of what you want to have based on XFS. We can help review or do part of the work, but trying to guess our way will only delay existing customers using existing HW. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-07 10:10 ` Javier González @ 2024-10-08 10:06 ` Hans Holmberg 2024-10-09 14:36 ` Javier Gonzalez 2024-10-08 12:25 ` Christoph Hellwig 1 sibling, 1 reply; 86+ messages in thread From: Hans Holmberg @ 2024-10-08 10:06 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> wrote: > > On 04.10.2024 14:30, Christoph Hellwig wrote: > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > >> So, considerign that file system _are_ able to use temperature hints and > >> actually make them work, why don't we support FDP the same way we are > >> supporting zones so that people can use it in production? > > > >Because apparently no one has tried it. It should be possible in theory, > >but for example unless you have power of 2 reclaim unit size size it > >won't work very well with XFS where the AGs/RTGs must be power of two > >aligned in the LBA space, except by overprovisioning the LBA space vs > >the capacity actually used. > > This is good. I think we should have at least a FS POC with data > placement support to be able to drive conclusions on how the interface > and requirements should be. Until we have that, we can support the > use-cases that we know customers are asking for, i.e., block-level hints > through the existing temperature API. > > > > >> I agree that down the road, an interface that allows hints (many more > >> than 5!) is needed. And in my opinion, this interface should not have > >> semintics attached to it, just a hint ID, #hints, and enough space to > >> put 100s of them to support storage node deployments. But this needs to > >> come from the users of the hints / zones / streams / etc, not from > >> us vendors. We do not have neither details on how they deploy these > >> features at scale, nor the workloads to validate the results. Anything > >> else will probably just continue polluting the storage stack with more > >> interfaces that are not used and add to the problem of data placement > >> fragmentation. > > > >Please always mentioned what layer you are talking about. At the syscall > >level the temperatur hints are doing quite ok. A full stream separation > >would obviously be a lot better, as would be communicating the zone / > >reclaim unit / etc size. > > I mean at the syscall level. But as mentioned above, we need to be very > sure that we have a clear use-case for that. If we continue seeing hints > being use in NVMe or other protocols, and the number increase > significantly, we can deal with it later on. > > > > >As an interface to a driver that doesn't natively speak temperature > >hint on the other hand it doesn't work at all. > > > >> The issue is that the first series of this patch, which is as simple as > >> it gets, hit the list in May. Since then we are down paths that lead > >> nowhere. So the line between real technical feedback that leads to > >> a feature being merged, and technical misleading to make people be a > >> busy bee becomes very thin. In the whole data placement effort, we have > >> been down this path many times, unfortunately... > > > >Well, the previous round was the first one actually trying to address the > >fundamental issue after 4 month. And then after a first round of feedback > >it gets shutdown somehow out of nowhere. As a maintainer and review that > >is the kinda of contributors I have a hard time taking serious. > > I am not sure I understand what you mean in the last sentece, so I will > not respond filling blanks with a bad interpretation. > > In summary, what we are asking for is to take the patches that cover the > current use-case, and work together on what might be needed for better > FS support. For this, it seems you and Hans have a good idea of what you > want to have based on XFS. We can help review or do part of the work, > but trying to guess our way will only delay existing customers using > existing HW. After reading the whole thread, I end up wondering why we need to rush the support for a single use case through instead of putting the architecture in place for properly supporting this new type of hardware from the start throughout the stack. Even for user space consumers of raw block devices, is the last version of the patch set good enough? * It severely cripples the data separation capabilities as only a handful of data placement buckets are supported * It just won't work if there is more than one user application per storage device as different applications data streams will be mixed at the nvme driver level.. While Christoph has already outlined what would be desirable from a file system point of view, I don't have the answer to what would be the overall best design for FDP. I would like to say that it looks to me like we need to consider more than more than the early adoption use cases and make sure we make the most of the hardware capabilities via logical abstractions that would be compatible with a wider range of storage devices. Figuring the right way forward is tricky, but why not just let it take the time that is needed to sort this out while early users explore how to use FDP drives and share the results? ^ permalink raw reply [flat|nested] 86+ messages in thread
* RE: [PATCH v7 0/3] FDP and per-io hints 2024-10-08 10:06 ` Hans Holmberg @ 2024-10-09 14:36 ` Javier Gonzalez 2024-10-10 6:40 ` Hans Holmberg 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-09 14:36 UTC (permalink / raw) To: Hans Holmberg Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] > -----Original Message----- > From: Hans Holmberg <[email protected]> > Sent: Tuesday, October 8, 2024 12:07 PM > To: Javier Gonzalez <[email protected]> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K. > Petersen <[email protected]>; Keith Busch <[email protected]>; > Kanchan Joshi <[email protected]>; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; linux- > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected] > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> > wrote: > > > > On 04.10.2024 14:30, Christoph Hellwig wrote: > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > > >> So, considerign that file system _are_ able to use temperature hints and > > >> actually make them work, why don't we support FDP the same way we are > > >> supporting zones so that people can use it in production? > > > > > >Because apparently no one has tried it. It should be possible in theory, > > >but for example unless you have power of 2 reclaim unit size size it > > >won't work very well with XFS where the AGs/RTGs must be power of two > > >aligned in the LBA space, except by overprovisioning the LBA space vs > > >the capacity actually used. > > > > This is good. I think we should have at least a FS POC with data > > placement support to be able to drive conclusions on how the interface > > and requirements should be. Until we have that, we can support the > > use-cases that we know customers are asking for, i.e., block-level hints > > through the existing temperature API. > > > > > > > >> I agree that down the road, an interface that allows hints (many more > > >> than 5!) is needed. And in my opinion, this interface should not have > > >> semintics attached to it, just a hint ID, #hints, and enough space to > > >> put 100s of them to support storage node deployments. But this needs to > > >> come from the users of the hints / zones / streams / etc, not from > > >> us vendors. We do not have neither details on how they deploy these > > >> features at scale, nor the workloads to validate the results. Anything > > >> else will probably just continue polluting the storage stack with more > > >> interfaces that are not used and add to the problem of data placement > > >> fragmentation. > > > > > >Please always mentioned what layer you are talking about. At the syscall > > >level the temperatur hints are doing quite ok. A full stream separation > > >would obviously be a lot better, as would be communicating the zone / > > >reclaim unit / etc size. > > > > I mean at the syscall level. But as mentioned above, we need to be very > > sure that we have a clear use-case for that. If we continue seeing hints > > being use in NVMe or other protocols, and the number increase > > significantly, we can deal with it later on. > > > > > > > >As an interface to a driver that doesn't natively speak temperature > > >hint on the other hand it doesn't work at all. > > > > > >> The issue is that the first series of this patch, which is as simple as > > >> it gets, hit the list in May. Since then we are down paths that lead > > >> nowhere. So the line between real technical feedback that leads to > > >> a feature being merged, and technical misleading to make people be a > > >> busy bee becomes very thin. In the whole data placement effort, we have > > >> been down this path many times, unfortunately... > > > > > >Well, the previous round was the first one actually trying to address the > > >fundamental issue after 4 month. And then after a first round of feedback > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > > >is the kinda of contributors I have a hard time taking serious. > > > > I am not sure I understand what you mean in the last sentece, so I will > > not respond filling blanks with a bad interpretation. > > > > In summary, what we are asking for is to take the patches that cover the > > current use-case, and work together on what might be needed for better > > FS support. For this, it seems you and Hans have a good idea of what you > > want to have based on XFS. We can help review or do part of the work, > > but trying to guess our way will only delay existing customers using > > existing HW. > > After reading the whole thread, I end up wondering why we need to rush the > support for a single use case through instead of putting the architecture > in place for properly supporting this new type of hardware from the start > throughout the stack. This is not a rush. We have been supporting this use case through passthru for over 1/2 year with code already upstream in Cachelib. This is mature enough as to move into the block layer, which is what the end user wants to do either way. This is though a very good point. This is why we upstreamed passthru at the time; so people can experiment, validate, and upstream only when there is a clear path. > > Even for user space consumers of raw block devices, is the last version > of the patch set good enough? > > * It severely cripples the data separation capabilities as only a handful of > data placement buckets are supported I could understand from your presentation at LPC, and late looking at the code that is available that you have been successful at getting good results with the existing interface in XFS. The mapping form the temperature semantics to zones (no semantics) is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space structures is great. > > * It just won't work if there is more than one user application per storage > device as different applications data streams will be mixed at the nvme > driver level.. For now this use-case is not clear. Folks working on it are using passthru. When we have a more clear understanding of what is needed, we might need changes in the kernel. If you see a need for this on the work that you are doing, by all means, please send patches. As I said at LPC, we can work together on this. > > While Christoph has already outlined what would be desirable from a > file system point of view, I don't have the answer to what would be the overall > best design for FDP. I would like to say that it looks to me like we need to > consider more than more than the early adoption use cases and make sure we > make the most of the hardware capabilities via logical abstractions that > would be compatible with a wider range of storage devices. > > Figuring the right way forward is tricky, but why not just let it take the time > that is needed to sort this out while early users explore how to use FDP > drives and share the results? I agree that we might need a new interface to support more hints, beyond the temperatures. Or maybe not. We would not know until someone comes with a use case. We have made the mistake in the past of treating internal research as upstreamable work. I know can see that this simply complicates the in-kernel and user-space APIs. The existing API is usable and requires no changes. There is hardware. There are customers. There are applications with upstream support which have been tested with passthru (the early results you mention). And the wiring to NVMe is _very_ simple. There is no reason not to take this in, and then we will see what new interfaces we might need in the future. I would much rather spend time in discussing ideas with you and others on a potential future API than arguing about the validity of an _existing_ one. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-09 14:36 ` Javier Gonzalez @ 2024-10-10 6:40 ` Hans Holmberg 2024-10-10 7:13 ` Javier Gonzalez 0 siblings, 1 reply; 86+ messages in thread From: Hans Holmberg @ 2024-10-10 6:40 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote: > > > > > -----Original Message----- > > From: Hans Holmberg <[email protected]> > > Sent: Tuesday, October 8, 2024 12:07 PM > > To: Javier Gonzalez <[email protected]> > > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K. > > Petersen <[email protected]>; Keith Busch <[email protected]>; > > Kanchan Joshi <[email protected]>; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; linux- > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected] > > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > > > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> > > wrote: > > > > > > On 04.10.2024 14:30, Christoph Hellwig wrote: > > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > > > >> So, considerign that file system _are_ able to use temperature hints and > > > >> actually make them work, why don't we support FDP the same way we are > > > >> supporting zones so that people can use it in production? > > > > > > > >Because apparently no one has tried it. It should be possible in theory, > > > >but for example unless you have power of 2 reclaim unit size size it > > > >won't work very well with XFS where the AGs/RTGs must be power of two > > > >aligned in the LBA space, except by overprovisioning the LBA space vs > > > >the capacity actually used. > > > > > > This is good. I think we should have at least a FS POC with data > > > placement support to be able to drive conclusions on how the interface > > > and requirements should be. Until we have that, we can support the > > > use-cases that we know customers are asking for, i.e., block-level hints > > > through the existing temperature API. > > > > > > > > > > >> I agree that down the road, an interface that allows hints (many more > > > >> than 5!) is needed. And in my opinion, this interface should not have > > > >> semintics attached to it, just a hint ID, #hints, and enough space to > > > >> put 100s of them to support storage node deployments. But this needs to > > > >> come from the users of the hints / zones / streams / etc, not from > > > >> us vendors. We do not have neither details on how they deploy these > > > >> features at scale, nor the workloads to validate the results. Anything > > > >> else will probably just continue polluting the storage stack with more > > > >> interfaces that are not used and add to the problem of data placement > > > >> fragmentation. > > > > > > > >Please always mentioned what layer you are talking about. At the syscall > > > >level the temperatur hints are doing quite ok. A full stream separation > > > >would obviously be a lot better, as would be communicating the zone / > > > >reclaim unit / etc size. > > > > > > I mean at the syscall level. But as mentioned above, we need to be very > > > sure that we have a clear use-case for that. If we continue seeing hints > > > being use in NVMe or other protocols, and the number increase > > > significantly, we can deal with it later on. > > > > > > > > > > >As an interface to a driver that doesn't natively speak temperature > > > >hint on the other hand it doesn't work at all. > > > > > > > >> The issue is that the first series of this patch, which is as simple as > > > >> it gets, hit the list in May. Since then we are down paths that lead > > > >> nowhere. So the line between real technical feedback that leads to > > > >> a feature being merged, and technical misleading to make people be a > > > >> busy bee becomes very thin. In the whole data placement effort, we have > > > >> been down this path many times, unfortunately... > > > > > > > >Well, the previous round was the first one actually trying to address the > > > >fundamental issue after 4 month. And then after a first round of feedback > > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > > > >is the kinda of contributors I have a hard time taking serious. > > > > > > I am not sure I understand what you mean in the last sentece, so I will > > > not respond filling blanks with a bad interpretation. > > > > > > In summary, what we are asking for is to take the patches that cover the > > > current use-case, and work together on what might be needed for better > > > FS support. For this, it seems you and Hans have a good idea of what you > > > want to have based on XFS. We can help review or do part of the work, > > > but trying to guess our way will only delay existing customers using > > > existing HW. > > > > After reading the whole thread, I end up wondering why we need to rush the > > support for a single use case through instead of putting the architecture > > in place for properly supporting this new type of hardware from the start > > throughout the stack. > > This is not a rush. We have been supporting this use case through passthru for > over 1/2 year with code already upstream in Cachelib. This is mature enough as > to move into the block layer, which is what the end user wants to do either way. > > This is though a very good point. This is why we upstreamed passthru at the > time; so people can experiment, validate, and upstream only when there is a > clear path. > > > > > Even for user space consumers of raw block devices, is the last version > > of the patch set good enough? > > > > * It severely cripples the data separation capabilities as only a handful of > > data placement buckets are supported > > I could understand from your presentation at LPC, and late looking at the code that > is available that you have been successful at getting good results with the existing > interface in XFS. The mapping form the temperature semantics to zones (no semantics) > is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space > structures is great. No, we don't map data directly to zones using lifetime hints. In fact, lifetime hints contribute only a relatively small part (~10% extra write amp reduction, see the rocksdb benchmark results). Segregating data by file is the most important part of the data placement heuristic, at least for this type of workload. > > > > > * It just won't work if there is more than one user application per storage > > device as different applications data streams will be mixed at the nvme > > driver level.. > > For now this use-case is not clear. Folks working on it are using passthru. When we > have a more clear understanding of what is needed, we might need changes in the kernel. > > If you see a need for this on the work that you are doing, by all means, please send patches. > As I said at LPC, we can work together on this. > > > > > While Christoph has already outlined what would be desirable from a > > file system point of view, I don't have the answer to what would be the overall > > best design for FDP. I would like to say that it looks to me like we need to > > consider more than more than the early adoption use cases and make sure we > > make the most of the hardware capabilities via logical abstractions that > > would be compatible with a wider range of storage devices. > > > > Figuring the right way forward is tricky, but why not just let it take the time > > that is needed to sort this out while early users explore how to use FDP > > drives and share the results? > > I agree that we might need a new interface to support more hints, beyond the temperatures. > Or maybe not. We would not know until someone comes with a use case. We have made the > mistake in the past of treating internal research as upstreamable work. I know can see that > this simply complicates the in-kernel and user-space APIs. > > The existing API is usable and requires no changes. There is hardware. There are customers. > There are applications with upstream support which have been tested with passthru (the > early results you mention). And the wiring to NVMe is _very_ simple. There is no reason > not to take this in, and then we will see what new interfaces we might need in the future. > > I would much rather spend time in discussing ideas with you and others on a potential > future API than arguing about the validity of an _existing_ one. > Yes, but while FDP support could be improved later on(happy to help if that'll be the case), I'm just afraid that less work now defining the way data placement is exposed is going to result in a bigger mess later when more use cases will be considered. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 6:40 ` Hans Holmberg @ 2024-10-10 7:13 ` Javier Gonzalez 2024-10-10 9:20 ` Christoph Hellwig 2024-10-10 10:46 ` Hans Holmberg 0 siblings, 2 replies; 86+ messages in thread From: Javier Gonzalez @ 2024-10-10 7:13 UTC (permalink / raw) To: Hans Holmberg Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 10.10.2024 08:40, Hans Holmberg wrote: >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote: >> >> >> >> > -----Original Message----- >> > From: Hans Holmberg <[email protected]> >> > Sent: Tuesday, October 8, 2024 12:07 PM >> > To: Javier Gonzalez <[email protected]> >> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K. >> > Petersen <[email protected]>; Keith Busch <[email protected]>; >> > Kanchan Joshi <[email protected]>; [email protected]; [email protected]; >> > [email protected]; [email protected]; [email protected]; [email protected]; >> > [email protected]; [email protected]; [email protected]; >> > [email protected]; [email protected]; linux- >> > [email protected]; [email protected]; [email protected]; >> > [email protected]; [email protected]; [email protected] >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints >> > >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> >> > wrote: >> > > >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> > > >> So, considerign that file system _are_ able to use temperature hints and >> > > >> actually make them work, why don't we support FDP the same way we are >> > > >> supporting zones so that people can use it in production? >> > > > >> > > >Because apparently no one has tried it. It should be possible in theory, >> > > >but for example unless you have power of 2 reclaim unit size size it >> > > >won't work very well with XFS where the AGs/RTGs must be power of two >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs >> > > >the capacity actually used. >> > > >> > > This is good. I think we should have at least a FS POC with data >> > > placement support to be able to drive conclusions on how the interface >> > > and requirements should be. Until we have that, we can support the >> > > use-cases that we know customers are asking for, i.e., block-level hints >> > > through the existing temperature API. >> > > >> > > > >> > > >> I agree that down the road, an interface that allows hints (many more >> > > >> than 5!) is needed. And in my opinion, this interface should not have >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to >> > > >> put 100s of them to support storage node deployments. But this needs to >> > > >> come from the users of the hints / zones / streams / etc, not from >> > > >> us vendors. We do not have neither details on how they deploy these >> > > >> features at scale, nor the workloads to validate the results. Anything >> > > >> else will probably just continue polluting the storage stack with more >> > > >> interfaces that are not used and add to the problem of data placement >> > > >> fragmentation. >> > > > >> > > >Please always mentioned what layer you are talking about. At the syscall >> > > >level the temperatur hints are doing quite ok. A full stream separation >> > > >would obviously be a lot better, as would be communicating the zone / >> > > >reclaim unit / etc size. >> > > >> > > I mean at the syscall level. But as mentioned above, we need to be very >> > > sure that we have a clear use-case for that. If we continue seeing hints >> > > being use in NVMe or other protocols, and the number increase >> > > significantly, we can deal with it later on. >> > > >> > > > >> > > >As an interface to a driver that doesn't natively speak temperature >> > > >hint on the other hand it doesn't work at all. >> > > > >> > > >> The issue is that the first series of this patch, which is as simple as >> > > >> it gets, hit the list in May. Since then we are down paths that lead >> > > >> nowhere. So the line between real technical feedback that leads to >> > > >> a feature being merged, and technical misleading to make people be a >> > > >> busy bee becomes very thin. In the whole data placement effort, we have >> > > >> been down this path many times, unfortunately... >> > > > >> > > >Well, the previous round was the first one actually trying to address the >> > > >fundamental issue after 4 month. And then after a first round of feedback >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that >> > > >is the kinda of contributors I have a hard time taking serious. >> > > >> > > I am not sure I understand what you mean in the last sentece, so I will >> > > not respond filling blanks with a bad interpretation. >> > > >> > > In summary, what we are asking for is to take the patches that cover the >> > > current use-case, and work together on what might be needed for better >> > > FS support. For this, it seems you and Hans have a good idea of what you >> > > want to have based on XFS. We can help review or do part of the work, >> > > but trying to guess our way will only delay existing customers using >> > > existing HW. >> > >> > After reading the whole thread, I end up wondering why we need to rush the >> > support for a single use case through instead of putting the architecture >> > in place for properly supporting this new type of hardware from the start >> > throughout the stack. >> >> This is not a rush. We have been supporting this use case through passthru for >> over 1/2 year with code already upstream in Cachelib. This is mature enough as >> to move into the block layer, which is what the end user wants to do either way. >> >> This is though a very good point. This is why we upstreamed passthru at the >> time; so people can experiment, validate, and upstream only when there is a >> clear path. >> >> > >> > Even for user space consumers of raw block devices, is the last version >> > of the patch set good enough? >> > >> > * It severely cripples the data separation capabilities as only a handful of >> > data placement buckets are supported >> >> I could understand from your presentation at LPC, and late looking at the code that >> is available that you have been successful at getting good results with the existing >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space >> structures is great. > >No, we don't map data directly to zones using lifetime hints. In fact, >lifetime hints contribute only a >relatively small part (~10% extra write amp reduction, see the >rocksdb benchmark results). >Segregating data by file is the most important part of the data >placement heuristic, at least >for this type of workload. Is this because RocksDB already does seggregation per file itself? Are you doing something specific on XFS or using your knoledge on RocksDB to map files with an "unwritten" protocol in the midde? In this context, we have collected data both using FDP natively in RocksDB and using the temperatures. Both look very good, because both are initiated by RocksDB, and the FS just passes the hints directly to the driver. I ask this to understand if this is the FS responsibility or the application's one. Our work points more to letting applications use the hints (as the use-cases are power users, like RocksDB). I agree with you that a FS could potentially make an improvement for legacy applications - we have not focused much on these though, so I trust you insights on it. >> >> > >> > * It just won't work if there is more than one user application per storage >> > device as different applications data streams will be mixed at the nvme >> > driver level.. >> >> For now this use-case is not clear. Folks working on it are using passthru. When we >> have a more clear understanding of what is needed, we might need changes in the kernel. >> >> If you see a need for this on the work that you are doing, by all means, please send patches. >> As I said at LPC, we can work together on this. >> >> > >> > While Christoph has already outlined what would be desirable from a >> > file system point of view, I don't have the answer to what would be the overall >> > best design for FDP. I would like to say that it looks to me like we need to >> > consider more than more than the early adoption use cases and make sure we >> > make the most of the hardware capabilities via logical abstractions that >> > would be compatible with a wider range of storage devices. >> > >> > Figuring the right way forward is tricky, but why not just let it take the time >> > that is needed to sort this out while early users explore how to use FDP >> > drives and share the results? >> >> I agree that we might need a new interface to support more hints, beyond the temperatures. >> Or maybe not. We would not know until someone comes with a use case. We have made the >> mistake in the past of treating internal research as upstreamable work. I know can see that >> this simply complicates the in-kernel and user-space APIs. >> >> The existing API is usable and requires no changes. There is hardware. There are customers. >> There are applications with upstream support which have been tested with passthru (the >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason >> not to take this in, and then we will see what new interfaces we might need in the future. >> >> I would much rather spend time in discussing ideas with you and others on a potential >> future API than arguing about the validity of an _existing_ one. >> > >Yes, but while FDP support could be improved later on(happy to help if >that'll be the case), >I'm just afraid that less work now defining the way data placement is >exposed is going to >result in a bigger mess later when more use cases will be considered. Please, see the message I responded on the other thread. I hope it is a way to move forward and actually work together on this. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 7:13 ` Javier Gonzalez @ 2024-10-10 9:20 ` Christoph Hellwig 2024-10-10 12:22 ` Javier Gonzalez 2024-10-10 10:46 ` Hans Holmberg 1 sibling, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-10 9:20 UTC (permalink / raw) To: Javier Gonzalez Cc: Hans Holmberg, Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote: > Is this because RocksDB already does seggregation per file itself? Are > you doing something specific on XFS or using your knoledge on RocksDB to > map files with an "unwritten" protocol in the midde? XFS doesn't really do anything smart at all except for grouping files with similar temperatures, but Hans can probably explain it in more detail. So yes, this relies on the application doing the data separation and using the most logical vehicle for it: files. > > In this context, we have collected data both using FDP natively in > RocksDB and using the temperatures. Both look very good, because both > are initiated by RocksDB, and the FS just passes the hints directly > to the driver. > > I ask this to understand if this is the FS responsibility or the > application's one. Our work points more to letting applications use the > hints (as the use-cases are power users, like RocksDB). I agree with you > that a FS could potentially make an improvement for legacy applications > - we have not focused much on these though, so I trust you insights on > it. As mentioned multiple times before in this thread this absolutely depends on the abstraction level of the application. If the application works on a raw device without a file system it obviously needs very low-level control. And in my opinion passthrough is by far the best interface for that level of control. If the application is using a file system there is no better basic level abstraction than a file, which can then be enhanced with relatively small amount of additional information going both ways: the file system telling the application what good file sizes and write patterns are, and the application telling the file system what files are good candidates to merge into the same write stream if the file system has to merge multiple actively written to files into a write stream. Trying to do low-level per I/O hints on top of a file system is a recipe for trouble because you now have to entities fighting over placement control. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 9:20 ` Christoph Hellwig @ 2024-10-10 12:22 ` Javier Gonzalez 2024-10-11 8:56 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-10 12:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Hans Holmberg, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 10.10.2024 11:20, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote: >> Is this because RocksDB already does seggregation per file itself? Are >> you doing something specific on XFS or using your knoledge on RocksDB to >> map files with an "unwritten" protocol in the midde? > >XFS doesn't really do anything smart at all except for grouping files >with similar temperatures, but Hans can probably explain it in more >detail. So yes, this relies on the application doing the data separation >and using the most logical vehicle for it: files. This makes sense. Agree. > >> >> In this context, we have collected data both using FDP natively in >> RocksDB and using the temperatures. Both look very good, because both >> are initiated by RocksDB, and the FS just passes the hints directly >> to the driver. >> >> I ask this to understand if this is the FS responsibility or the >> application's one. Our work points more to letting applications use the >> hints (as the use-cases are power users, like RocksDB). I agree with you >> that a FS could potentially make an improvement for legacy applications >> - we have not focused much on these though, so I trust you insights on >> it. > >As mentioned multiple times before in this thread this absolutely >depends on the abstraction level of the application. If the application >works on a raw device without a file system it obviously needs very >low-level control. And in my opinion passthrough is by far the best >interface for that level of control. Passthru is great for prototyping and getting insights on end-to-end applicability. We see though that it is difficult to get a full solution based on it, unless people implement a use-space layer tailored to their use-case (e.g., a version SPDK's bdev). After the POC phase, most folks that can use passthru prefer to move to block - with a validated use-case it should be easier to get things upstream. This is exactly where we are now. >If the application is using a >file system there is no better basic level abstraction than a file, >which can then be enhanced with relatively small amount of additional >information going both ways: the file system telling the application >what good file sizes and write patterns are, and the application telling >the file system what files are good candidates to merge into the same >write stream if the file system has to merge multiple actively written >to files into a write stream. Trying to do low-level per I/O hints >on top of a file system is a recipe for trouble because you now have >to entities fighting over placement control. For file, I agree with you. If you saw the comments from Christian on the inode space, there are a few plumbing challenges. Do you have any patches we could look at? ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 12:22 ` Javier Gonzalez @ 2024-10-11 8:56 ` Christoph Hellwig 2024-10-11 12:21 ` Javier Gonzalez 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-11 8:56 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Hans Holmberg, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote: > Passthru is great for prototyping and getting insights on end-to-end > applicability. We see though that it is difficult to get a full solution > based on it, unless people implement a use-space layer tailored to their > use-case (e.g., a version SPDK's bdev). After the POC phase, most folks > that can use passthru prefer to move to block - with a validated > use-case it should be easier to get things upstream. > > This is exactly where we are now. That's a lot of marketing babble :) What exact thing is missing from the passthrough interface when using say spdx over io_uring? > If you saw the comments from Christian on the inode space, there are a > few plumbing challenges. Do you have any patches we could look at? I'm not sure what you refer to here. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-11 8:56 ` Christoph Hellwig @ 2024-10-11 12:21 ` Javier Gonzalez 2024-10-11 16:59 ` Keith Busch 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-11 12:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Hans Holmberg, Jens Axboe, Christian Brauner, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 11.10.2024 10:56, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote: >> Passthru is great for prototyping and getting insights on end-to-end >> applicability. We see though that it is difficult to get a full solution >> based on it, unless people implement a use-space layer tailored to their >> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks >> that can use passthru prefer to move to block - with a validated >> use-case it should be easier to get things upstream. >> >> This is exactly where we are now. > >That's a lot of marketing babble :) What exact thing is missing >from the passthrough interface when using say spdx over io_uring? The block layer provides a lot of functionality that passthru cannot provide. A simple example would be splits. You know this :) I am sure Jens and Keith can give you more specifics on their particular reasons. > >> If you saw the comments from Christian on the inode space, there are a >> few plumbing challenges. Do you have any patches we could look at? > >I'm not sure what you refer to here. This from Christian: https://lore.kernel.org/all/20240903-erfassen-bandmitglieder-32dfaeee66b2@brauner/ ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-11 12:21 ` Javier Gonzalez @ 2024-10-11 16:59 ` Keith Busch 0 siblings, 0 replies; 86+ messages in thread From: Keith Busch @ 2024-10-11 16:59 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Hans Holmberg, Jens Axboe, Christian Brauner, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Fri, Oct 11, 2024 at 02:21:02PM +0200, Javier Gonzalez wrote: > > That's a lot of marketing babble :) What exact thing is missing > > from the passthrough interface when using say spdx over io_uring? > > The block layer provides a lot of functionality that passthru cannot > provide. A simple example would be splits. You know this :) > > I am sure Jens and Keith can give you more specifics on their particular > reasons. Splitting, merging, cgroups, iostats, error retries, failover. We have a recent change staged in 6.13 to try to count iostats on passthrough, but it doesn't do discards. Passthrough to partitions requires root access and the slow CAP_SYSADMIN check on every IO. Page cache. Though that may not readily work with this io_uring proposal as-is either. The generic IO path is safe to changing formats; passthrough isn't. And it's just generally more work to port existing applications to use the passthrough interface. It causes duplicating solutions to scenarios the kernel already handles. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 7:13 ` Javier Gonzalez 2024-10-10 9:20 ` Christoph Hellwig @ 2024-10-10 10:46 ` Hans Holmberg [not found] ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com> 1 sibling, 1 reply; 86+ messages in thread From: Hans Holmberg @ 2024-10-10 10:46 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <[email protected]> wrote: > > On 10.10.2024 08:40, Hans Holmberg wrote: > >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote: > >> > >> > >> > >> > -----Original Message----- > >> > From: Hans Holmberg <[email protected]> > >> > Sent: Tuesday, October 8, 2024 12:07 PM > >> > To: Javier Gonzalez <[email protected]> > >> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K. > >> > Petersen <[email protected]>; Keith Busch <[email protected]>; > >> > Kanchan Joshi <[email protected]>; [email protected]; [email protected]; > >> > [email protected]; [email protected]; [email protected]; [email protected]; > >> > [email protected]; [email protected]; [email protected]; > >> > [email protected]; [email protected]; linux- > >> > [email protected]; [email protected]; [email protected]; > >> > [email protected]; [email protected]; [email protected] > >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > >> > > >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> > >> > wrote: > >> > > > >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: > >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > >> > > >> So, considerign that file system _are_ able to use temperature hints and > >> > > >> actually make them work, why don't we support FDP the same way we are > >> > > >> supporting zones so that people can use it in production? > >> > > > > >> > > >Because apparently no one has tried it. It should be possible in theory, > >> > > >but for example unless you have power of 2 reclaim unit size size it > >> > > >won't work very well with XFS where the AGs/RTGs must be power of two > >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs > >> > > >the capacity actually used. > >> > > > >> > > This is good. I think we should have at least a FS POC with data > >> > > placement support to be able to drive conclusions on how the interface > >> > > and requirements should be. Until we have that, we can support the > >> > > use-cases that we know customers are asking for, i.e., block-level hints > >> > > through the existing temperature API. > >> > > > >> > > > > >> > > >> I agree that down the road, an interface that allows hints (many more > >> > > >> than 5!) is needed. And in my opinion, this interface should not have > >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to > >> > > >> put 100s of them to support storage node deployments. But this needs to > >> > > >> come from the users of the hints / zones / streams / etc, not from > >> > > >> us vendors. We do not have neither details on how they deploy these > >> > > >> features at scale, nor the workloads to validate the results. Anything > >> > > >> else will probably just continue polluting the storage stack with more > >> > > >> interfaces that are not used and add to the problem of data placement > >> > > >> fragmentation. > >> > > > > >> > > >Please always mentioned what layer you are talking about. At the syscall > >> > > >level the temperatur hints are doing quite ok. A full stream separation > >> > > >would obviously be a lot better, as would be communicating the zone / > >> > > >reclaim unit / etc size. > >> > > > >> > > I mean at the syscall level. But as mentioned above, we need to be very > >> > > sure that we have a clear use-case for that. If we continue seeing hints > >> > > being use in NVMe or other protocols, and the number increase > >> > > significantly, we can deal with it later on. > >> > > > >> > > > > >> > > >As an interface to a driver that doesn't natively speak temperature > >> > > >hint on the other hand it doesn't work at all. > >> > > > > >> > > >> The issue is that the first series of this patch, which is as simple as > >> > > >> it gets, hit the list in May. Since then we are down paths that lead > >> > > >> nowhere. So the line between real technical feedback that leads to > >> > > >> a feature being merged, and technical misleading to make people be a > >> > > >> busy bee becomes very thin. In the whole data placement effort, we have > >> > > >> been down this path many times, unfortunately... > >> > > > > >> > > >Well, the previous round was the first one actually trying to address the > >> > > >fundamental issue after 4 month. And then after a first round of feedback > >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > >> > > >is the kinda of contributors I have a hard time taking serious. > >> > > > >> > > I am not sure I understand what you mean in the last sentece, so I will > >> > > not respond filling blanks with a bad interpretation. > >> > > > >> > > In summary, what we are asking for is to take the patches that cover the > >> > > current use-case, and work together on what might be needed for better > >> > > FS support. For this, it seems you and Hans have a good idea of what you > >> > > want to have based on XFS. We can help review or do part of the work, > >> > > but trying to guess our way will only delay existing customers using > >> > > existing HW. > >> > > >> > After reading the whole thread, I end up wondering why we need to rush the > >> > support for a single use case through instead of putting the architecture > >> > in place for properly supporting this new type of hardware from the start > >> > throughout the stack. > >> > >> This is not a rush. We have been supporting this use case through passthru for > >> over 1/2 year with code already upstream in Cachelib. This is mature enough as > >> to move into the block layer, which is what the end user wants to do either way. > >> > >> This is though a very good point. This is why we upstreamed passthru at the > >> time; so people can experiment, validate, and upstream only when there is a > >> clear path. > >> > >> > > >> > Even for user space consumers of raw block devices, is the last version > >> > of the patch set good enough? > >> > > >> > * It severely cripples the data separation capabilities as only a handful of > >> > data placement buckets are supported > >> > >> I could understand from your presentation at LPC, and late looking at the code that > >> is available that you have been successful at getting good results with the existing > >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) > >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space > >> structures is great. > > > >No, we don't map data directly to zones using lifetime hints. In fact, > >lifetime hints contribute only a > >relatively small part (~10% extra write amp reduction, see the > >rocksdb benchmark results). > >Segregating data by file is the most important part of the data > >placement heuristic, at least > >for this type of workload. > > Is this because RocksDB already does seggregation per file itself? Are > you doing something specific on XFS or using your knoledge on RocksDB to > map files with an "unwritten" protocol in the midde? Data placement by-file is based on that the lifetime of a file's data blocks are strongly correlated. When a file is deleted, all its blocks will be reclaimable at that point. This requires knowledge about the data placement buckets and works really well without any hints provided. The life-time hint heuristic I added on top is based on rocksdb statistics, but designed to be generic enough to work for a wider range of workloads (still need to validate this though - more work to be done). > > In this context, we have collected data both using FDP natively in > RocksDB and using the temperatures. Both look very good, because both > are initiated by RocksDB, and the FS just passes the hints directly > to the driver. > > I ask this to understand if this is the FS responsibility or the > application's one. Our work points more to letting applications use the > hints (as the use-cases are power users, like RocksDB). I agree with you > that a FS could potentially make an improvement for legacy applications > - we have not focused much on these though, so I trust you insights on > it. The big problem as I see it is that if applications are going to work well together on the same media we need a common placement implementation somewhere, and it seems pretty natural to make it part of filesystems to me. > > >> > >> > > >> > * It just won't work if there is more than one user application per storage > >> > device as different applications data streams will be mixed at the nvme > >> > driver level.. > >> > >> For now this use-case is not clear. Folks working on it are using passthru. When we > >> have a more clear understanding of what is needed, we might need changes in the kernel. > >> > >> If you see a need for this on the work that you are doing, by all means, please send patches. > >> As I said at LPC, we can work together on this. > >> > >> > > >> > While Christoph has already outlined what would be desirable from a > >> > file system point of view, I don't have the answer to what would be the overall > >> > best design for FDP. I would like to say that it looks to me like we need to > >> > consider more than more than the early adoption use cases and make sure we > >> > make the most of the hardware capabilities via logical abstractions that > >> > would be compatible with a wider range of storage devices. > >> > > >> > Figuring the right way forward is tricky, but why not just let it take the time > >> > that is needed to sort this out while early users explore how to use FDP > >> > drives and share the results? > >> > >> I agree that we might need a new interface to support more hints, beyond the temperatures. > >> Or maybe not. We would not know until someone comes with a use case. We have made the > >> mistake in the past of treating internal research as upstreamable work. I know can see that > >> this simply complicates the in-kernel and user-space APIs. > >> > >> The existing API is usable and requires no changes. There is hardware. There are customers. > >> There are applications with upstream support which have been tested with passthru (the > >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason > >> not to take this in, and then we will see what new interfaces we might need in the future. > >> > >> I would much rather spend time in discussing ideas with you and others on a potential > >> future API than arguing about the validity of an _existing_ one. > >> > > > >Yes, but while FDP support could be improved later on(happy to help if > >that'll be the case), > >I'm just afraid that less work now defining the way data placement is > >exposed is going to > >result in a bigger mess later when more use cases will be considered. > > Please, see the message I responded on the other thread. I hope it is a > way to move forward and actually work together on this. ^ permalink raw reply [flat|nested] 86+ messages in thread
[parent not found: <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com>]
* Re: [PATCH v7 0/3] FDP and per-io hints [not found] ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com> @ 2024-10-10 12:27 ` Javier Gonzalez 2024-10-11 8:59 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-10 12:27 UTC (permalink / raw) To: Hans Holmberg Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 10.10.2024 12:46, Hans Holmberg wrote: >On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <[email protected]> wrote: >> >> On 10.10.2024 08:40, Hans Holmberg wrote: >> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <[email protected]> wrote: >> >> >> >> >> >> >> >> > -----Original Message----- >> >> > From: Hans Holmberg <[email protected]> >> >> > Sent: Tuesday, October 8, 2024 12:07 PM >> >> > To: Javier Gonzalez <[email protected]> >> >> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Martin K. >> >> > Petersen <[email protected]>; Keith Busch <[email protected]>; >> >> > Kanchan Joshi <[email protected]>; [email protected]; [email protected]; >> >> > [email protected]; [email protected]; [email protected]; [email protected]; >> >> > [email protected]; [email protected]; [email protected]; >> >> > [email protected]; [email protected]; linux- >> >> > [email protected]; [email protected]; [email protected]; >> >> > [email protected]; [email protected]; [email protected] >> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints >> >> > >> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <[email protected]> >> >> > wrote: >> >> > > >> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: >> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> >> > > >> So, considerign that file system _are_ able to use temperature hints and >> >> > > >> actually make them work, why don't we support FDP the same way we are >> >> > > >> supporting zones so that people can use it in production? >> >> > > > >> >> > > >Because apparently no one has tried it. It should be possible in theory, >> >> > > >but for example unless you have power of 2 reclaim unit size size it >> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two >> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs >> >> > > >the capacity actually used. >> >> > > >> >> > > This is good. I think we should have at least a FS POC with data >> >> > > placement support to be able to drive conclusions on how the interface >> >> > > and requirements should be. Until we have that, we can support the >> >> > > use-cases that we know customers are asking for, i.e., block-level hints >> >> > > through the existing temperature API. >> >> > > >> >> > > > >> >> > > >> I agree that down the road, an interface that allows hints (many more >> >> > > >> than 5!) is needed. And in my opinion, this interface should not have >> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to >> >> > > >> put 100s of them to support storage node deployments. But this needs to >> >> > > >> come from the users of the hints / zones / streams / etc, not from >> >> > > >> us vendors. We do not have neither details on how they deploy these >> >> > > >> features at scale, nor the workloads to validate the results. Anything >> >> > > >> else will probably just continue polluting the storage stack with more >> >> > > >> interfaces that are not used and add to the problem of data placement >> >> > > >> fragmentation. >> >> > > > >> >> > > >Please always mentioned what layer you are talking about. At the syscall >> >> > > >level the temperatur hints are doing quite ok. A full stream separation >> >> > > >would obviously be a lot better, as would be communicating the zone / >> >> > > >reclaim unit / etc size. >> >> > > >> >> > > I mean at the syscall level. But as mentioned above, we need to be very >> >> > > sure that we have a clear use-case for that. If we continue seeing hints >> >> > > being use in NVMe or other protocols, and the number increase >> >> > > significantly, we can deal with it later on. >> >> > > >> >> > > > >> >> > > >As an interface to a driver that doesn't natively speak temperature >> >> > > >hint on the other hand it doesn't work at all. >> >> > > > >> >> > > >> The issue is that the first series of this patch, which is as simple as >> >> > > >> it gets, hit the list in May. Since then we are down paths that lead >> >> > > >> nowhere. So the line between real technical feedback that leads to >> >> > > >> a feature being merged, and technical misleading to make people be a >> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have >> >> > > >> been down this path many times, unfortunately... >> >> > > > >> >> > > >Well, the previous round was the first one actually trying to address the >> >> > > >fundamental issue after 4 month. And then after a first round of feedback >> >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that >> >> > > >is the kinda of contributors I have a hard time taking serious. >> >> > > >> >> > > I am not sure I understand what you mean in the last sentece, so I will >> >> > > not respond filling blanks with a bad interpretation. >> >> > > >> >> > > In summary, what we are asking for is to take the patches that cover the >> >> > > current use-case, and work together on what might be needed for better >> >> > > FS support. For this, it seems you and Hans have a good idea of what you >> >> > > want to have based on XFS. We can help review or do part of the work, >> >> > > but trying to guess our way will only delay existing customers using >> >> > > existing HW. >> >> > >> >> > After reading the whole thread, I end up wondering why we need to rush the >> >> > support for a single use case through instead of putting the architecture >> >> > in place for properly supporting this new type of hardware from the start >> >> > throughout the stack. >> >> >> >> This is not a rush. We have been supporting this use case through passthru for >> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as >> >> to move into the block layer, which is what the end user wants to do either way. >> >> >> >> This is though a very good point. This is why we upstreamed passthru at the >> >> time; so people can experiment, validate, and upstream only when there is a >> >> clear path. >> >> >> >> > >> >> > Even for user space consumers of raw block devices, is the last version >> >> > of the patch set good enough? >> >> > >> >> > * It severely cripples the data separation capabilities as only a handful of >> >> > data placement buckets are supported >> >> >> >> I could understand from your presentation at LPC, and late looking at the code that >> >> is available that you have been successful at getting good results with the existing >> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) >> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space >> >> structures is great. >> > >> >No, we don't map data directly to zones using lifetime hints. In fact, >> >lifetime hints contribute only a >> >relatively small part (~10% extra write amp reduction, see the >> >rocksdb benchmark results). >> >Segregating data by file is the most important part of the data >> >placement heuristic, at least >> >for this type of workload. >> >> Is this because RocksDB already does seggregation per file itself? Are >> you doing something specific on XFS or using your knoledge on RocksDB to >> map files with an "unwritten" protocol in the midde? > >Data placement by-file is based on that the lifetime of a file's data >blocks are strongly correlated. When a file is deleted, all its blocks >will be reclaimable at that point. This requires knowledge about the >data placement buckets and works really well without any hints >provided. But we need hints to put files together. I believe you do this already, as no placement protocol gives you unlimited separation. >The life-time hint heuristic I added on top is based on rocksdb >statistics, but designed to be generic enough to work for a wider >range of workloads (still need to validate this though - more work to >be done). Maybe you can post some patches on the parts dedicated to the VFS level and user-space API (syscall or uring)? Following on the comment to Christoph, it would be good to have something tangible to work together on for the next stage of this support. > >> >> In this context, we have collected data both using FDP natively in >> RocksDB and using the temperatures. Both look very good, because both >> are initiated by RocksDB, and the FS just passes the hints directly >> to the driver. >> >> I ask this to understand if this is the FS responsibility or the >> application's one. Our work points more to letting applications use the >> hints (as the use-cases are power users, like RocksDB). I agree with you >> that a FS could potentially make an improvement for legacy applications >> - we have not focused much on these though, so I trust you insights on >> it. > >The big problem as I see it is that if applications are going to work >well together on the same media we need a common placement >implementation somewhere, and it seems pretty natural to make it part >of filesystems to me. For FS users, makes a lot of sense. But we still need to cover applications using raw block. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 12:27 ` Javier Gonzalez @ 2024-10-11 8:59 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-11 8:59 UTC (permalink / raw) To: Javier Gonzalez Cc: Hans Holmberg, Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Thu, Oct 10, 2024 at 02:27:33PM +0200, Javier Gonzalez wrote: [full quote snipped, it would be really helpful to only quote what you actuall reference per the usual email rules] >> Data placement by-file is based on that the lifetime of a file's data >> blocks are strongly correlated. When a file is deleted, all its blocks >> will be reclaimable at that point. This requires knowledge about the >> data placement buckets and works really well without any hints >> provided. > > But we need hints to put files together. I believe you do this already, > as no placement protocol gives you unlimited separation. The per-file temperature hints do a reasonable job for that. A fully developed version of the separate write streams submitted by Kanchan would probably do even better, but for now the per-file hints seem to be enough. > Maybe you can post some patches on the parts dedicated to the VFS level > and user-space API (syscall or uring)? It's just using the existing temperature hints. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-07 10:10 ` Javier González 2024-10-08 10:06 ` Hans Holmberg @ 2024-10-08 12:25 ` Christoph Hellwig 2024-10-08 14:44 ` Keith Busch 1 sibling, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-08 12:25 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Keith Busch, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote: > In summary, what we are asking for is to take the patches that cover the > current use-case, and work together on what might be needed for better > FS support. And I really do not think it is a good idea. For one it actually works against the stated intent of the FDP spec. Second extending the hints to per per-I/O in the io_uring patch is actively breaking the nice per-file I/O hint abstraction we have right now, and is really unsuitable when actually used on a file and not just a block device. And if you are only on a block device I think passthrough of some form is still the far better option, despite the problems with it mentioned by Keith. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-08 12:25 ` Christoph Hellwig @ 2024-10-08 14:44 ` Keith Busch 2024-10-09 9:28 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-08 14:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Javier González, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Tue, Oct 08, 2024 at 02:25:35PM +0200, Christoph Hellwig wrote: > On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote: > > In summary, what we are asking for is to take the patches that cover the > > current use-case, and work together on what might be needed for better > > FS support. > > And I really do not think it is a good idea. For one it actually > works against the stated intent of the FDP spec. Second extending > the hints to per per-I/O in the io_uring patch is actively breaking > the nice per-file I/O hint abstraction we have right now, and is > really unsuitable when actually used on a file and not just a block > device. And if you are only on a block device I think passthrough > of some form is still the far better option, despite the problems > with it mentioned by Keith. Then let's just continue with patches 1 and 2. They introduce no new user or kernel APIs, and people have already reported improvements using it. Further, it is just a hint, it doesn't lock the kernel into anything that may hinder future inovations and enhancements. So let's unblock users and refocus *our* time to something more productive, please? And to be honest, the per-io hints for generic read/write use is only valuable for my users if metadata is also exposed to userspace. I know Javier's team is working on that in parallel, so per-io hints are a lower priority for me until that part settles. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-08 14:44 ` Keith Busch @ 2024-10-09 9:28 ` Christoph Hellwig 2024-10-09 15:06 ` Keith Busch 2024-10-09 16:28 ` Nitesh Shetty 0 siblings, 2 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-09 9:28 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Javier González, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: > Then let's just continue with patches 1 and 2. They introduce no new > user or kernel APIs, and people have already reported improvements using > it. They are still not any way actually exposing the FDP functionality in the standard though. How is your application going to align anything to the reclaim unit? Or is this another of the cases where as a hyperscaler you just "know" from the data sheet? But also given that the submitter completely disappeared and refuses to even discuss his patches I thing they are simply abandonware at this point anyway. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-09 9:28 ` Christoph Hellwig @ 2024-10-09 15:06 ` Keith Busch [not found] ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com> 2024-10-10 9:10 ` Christoph Hellwig 2024-10-09 16:28 ` Nitesh Shetty 1 sibling, 2 replies; 86+ messages in thread From: Keith Busch @ 2024-10-09 15:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Javier González, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote: > On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: > > Then let's just continue with patches 1 and 2. They introduce no new > > user or kernel APIs, and people have already reported improvements using > > it. > > They are still not any way actually exposing the FDP functionality > in the standard though. How is your application going to align > anything to the reclaim unit? Or is this another of the cases where > as a hyperscaler you just "know" from the data sheet? As far as I know, this is an inconsequential spec detail that is not being considered by any applications testing this. And yet, the expected imrpovements are still there, so I don't see a point holding this up for that reason. ^ permalink raw reply [flat|nested] 86+ messages in thread
[parent not found: <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com>]
* Re: [PATCH v7 0/3] FDP and per-io hints [not found] ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com> @ 2024-10-10 7:07 ` Javier González 2024-10-10 9:13 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-10 7:07 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 09.10.2024 09:06, Keith Busch wrote: >On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote: >> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: >> > Then let's just continue with patches 1 and 2. They introduce no new >> > user or kernel APIs, and people have already reported improvements using >> > it. >> >> They are still not any way actually exposing the FDP functionality >> in the standard though. How is your application going to align >> anything to the reclaim unit? Or is this another of the cases where >> as a hyperscaler you just "know" from the data sheet? > >As far as I know, this is an inconsequential spec detail that is not >being considered by any applications testing this. And yet, the expected >imrpovements are still there, so I don't see a point holding this up for >that reason. I am re-reading the thread to find a common ground. I realize that my last email came sharper than I intended. I apologize for that Christoph. Let me summarize and propose something constructive so that we can move forward. It is clear that you have a lot of insight on how a FS API should look like to unify the different data placement alternatives across protocols. I have not seen code for it, but based on the technical feedback you are providing, it seems to be fairly clear in your mind. I think we should attempt to pursue that with an example in mind. Seems XFS is the clear candidate. You have done work already in enable SMR HDDs; it seems we can get FDP under that umbrella. This will however take time to get right. We can help with development, testing, and experimental evaluation on the WAF benefits for such an interface. However, this work should not block existing hardware enabling an existing use-case. The current patches are not intrusive. They do not make changse to the API and merely wire up what is there to the driver. Anyone using temperaturs will be able to use FDP - this is a win without a maintainance burden attached to it. The change to the FS / application API will not require major changes either; I believe we all agree that we cannot remove the temperatures, so all existing temperature users will be able to continue using them; we will just add an alternative for power users on the side. So the proposal is to merge the patches as they are and commit to work together on a new, better API for in-kernel users (FS), and for applications (syscall, uring). Christoph, would this be a viable way to move forward for you? Thanks, Javier ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 7:07 ` Javier González @ 2024-10-10 9:13 ` Christoph Hellwig 2024-10-10 11:59 ` Javier González 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-10 9:13 UTC (permalink / raw) To: Javier González Cc: Keith Busch, Christoph Hellwig, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote: > I think we should attempt to pursue that with an example in mind. Seems > XFS is the clear candidate. You have done work already in enable SMR > HDDs; it seems we can get FDP under that umbrella. This will however > take time to get right. We can help with development, testing, and > experimental evaluation on the WAF benefits for such an interface. Or ZNS SSDs for that matter. > However, this work should not block existing hardware enabling an > existing use-case. The current patches are not intrusive. They do not > make changse to the API and merely wire up what is there to the driver. > Anyone using temperaturs will be able to use FDP - this is a win without > a maintainance burden attached to it. The change to the FS / application > API will not require major changes either; I believe we all agree that > we cannot remove the temperatures, so all existing temperature users > will be able to continue using them; we will just add an alternative for > power users on the side. As mentioned probably close to a dozen times over this thread and it's predecessors: Keeping the per-file I/O hint API and mapping that to FDP is fine. Exposing the overly-simplistic hints to the NVMe driver through the entire I/O stack and locking us into that is not. > So the proposal is to merge the patches as they are and commit to work > together on a new, better API for in-kernel users (FS), and for > applications (syscall, uring). And because of that the whole merge it now and fix it later unfortunately doesn't work, as a proper implementation will regress behavior for at least some users that only have the I/O hints API but try to emulate real stream separation with it. With the per-I/O hints in io_uring that is in fact almost guaranteed. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 9:13 ` Christoph Hellwig @ 2024-10-10 11:59 ` Javier González 2024-10-11 9:02 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier González @ 2024-10-10 11:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 10.10.2024 11:13, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote: >> I think we should attempt to pursue that with an example in mind. Seems >> XFS is the clear candidate. You have done work already in enable SMR >> HDDs; it seems we can get FDP under that umbrella. This will however >> take time to get right. We can help with development, testing, and >> experimental evaluation on the WAF benefits for such an interface. > >Or ZNS SSDs for that matter. Maybe. I do not see much movement on this. > >> However, this work should not block existing hardware enabling an >> existing use-case. The current patches are not intrusive. They do not >> make changse to the API and merely wire up what is there to the driver. >> Anyone using temperaturs will be able to use FDP - this is a win without >> a maintainance burden attached to it. The change to the FS / application >> API will not require major changes either; I believe we all agree that >> we cannot remove the temperatures, so all existing temperature users >> will be able to continue using them; we will just add an alternative for >> power users on the side. > >As mentioned probably close to a dozen times over this thread and it's >predecessors: Keeping the per-file I/O hint API and mapping that to >FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >through the entire I/O stack and locking us into that is not. I don't understand the "locking us into that" part. >> So the proposal is to merge the patches as they are and commit to work >> together on a new, better API for in-kernel users (FS), and for >> applications (syscall, uring). > >And because of that the whole merge it now and fix it later unfortunately >doesn't work, as a proper implementation will regress behavior for at >least some users that only have the I/O hints API but try to emulate >real stream separation with it. With the per-I/O hints in io_uring that >is in fact almost guaranteed. I do not thing this argument is valid. These patches are not a merge now, fix later. It is ma: use the current interface, work together on a new one, if needed. The new interface you are talking about is not clear to me, at all. If you could share some patches on how that would look like, then it would give a much clearer idea of what you have in mind. The whole idea of let's not cover an existing use-case because we might be able to do something in the future is not very tangible. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-10 11:59 ` Javier González @ 2024-10-11 9:02 ` Christoph Hellwig 2024-10-11 17:08 ` Jens Axboe 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-11 9:02 UTC (permalink / raw) To: Javier González Cc: Christoph Hellwig, Keith Busch, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Thu, Oct 10, 2024 at 01:59:14PM +0200, Javier González wrote: >> As mentioned probably close to a dozen times over this thread and it's >> predecessors: Keeping the per-file I/O hint API and mapping that to >> FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >> through the entire I/O stack and locking us into that is not. > > I don't understand the "locking us into that" part. The patches as submitted do the two following two things: 1) interpret the simple temperature hints to map to FDP reclaim handles 2) add a new interface to set the temperature hints per I/O and also rely on an annoying existing implementation detail where the I/O hints set on random files get automatically propagated to the block device without file system involvement. This means we can't easily make the nvme driver actually use smarter hints that expose the actual FDP capabilities without breaking users that relied on the existing behavior, especially the per-I/O hints that counteract any kind of file system based data placement. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-11 9:02 ` Christoph Hellwig @ 2024-10-11 17:08 ` Jens Axboe 2024-10-14 6:21 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Jens Axboe @ 2024-10-11 17:08 UTC (permalink / raw) To: Christoph Hellwig, Javier González Cc: Keith Busch, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On 10/11/24 3:02 AM, Christoph Hellwig wrote: >>> As mentioned probably close to a dozen times over this thread and it's >>> predecessors: Keeping the per-file I/O hint API and mapping that to >>> FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >>> through the entire I/O stack and locking us into that is not. >> >> I don't understand the "locking us into that" part. > > The patches as submitted do the two following two things: > > 1) interpret the simple temperature hints to map to FDP reclaim handles > 2) add a new interface to set the temperature hints per I/O > > and also rely on an annoying existing implementation detail where the I/O > hints set on random files get automatically propagated to the block > device without file system involvement. > > This means we can't easily make the nvme driver actually use smarter > hints that expose the actual FDP capabilities without breaking users > that relied on the existing behavior, especially the per-I/O hints that > counteract any kind of file system based data placement. > I think that last argument is a straw man - for any kind of interface like this, we've ALWAYS just had the rule that any per-whatever overrides the generic setting. Per IO hints would do the same thing. Is this a mess if a user has assigned a file hint and uses other per IO hints on that very same file and they differ? Certainly, but that's really just the user/app being dumb. Or maybe being smart, as this particular one block is always hot in the file (yeah contrived case). The point is, if you mix and match per-io hints and per-file hints, then you're probably being pretty silly and nobody should do that. I don't think this is a practical concern at all. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-11 17:08 ` Jens Axboe @ 2024-10-14 6:21 ` Christoph Hellwig 2024-10-14 7:02 ` Javier Gonzalez 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-14 6:21 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, Javier González, Keith Busch, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote: > > I think that last argument is a straw man - for any kind of interface > like this, we've ALWAYS just had the rule that any per-whatever > overrides the generic setting. And exactly that is the problem. For file systems we can't support that sanely. So IFF you absolutely want the per-I/O hints we need an opt in by the file operations. I've said that at least twice in this discussion before, but as everyone likes to have political discussions instead of technical ones no one replied to that. ^ permalink raw reply [flat|nested] 86+ messages in thread
* RE: [PATCH v7 0/3] FDP and per-io hints 2024-10-14 6:21 ` Christoph Hellwig @ 2024-10-14 7:02 ` Javier Gonzalez 2024-10-14 7:47 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-14 7:02 UTC (permalink / raw) To: Christoph Hellwig, Jens Axboe Cc: Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] > -----Original Message----- > From: Christoph Hellwig <[email protected]> > Sent: Monday, October 14, 2024 8:21 AM > To: Jens Axboe <[email protected]> > Cc: Christoph Hellwig <[email protected]>; Javier Gonzalez <[email protected]>; > Keith Busch <[email protected]>; Martin K. Petersen > <[email protected]>; Kanchan Joshi <[email protected]>; > [email protected]; [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; linux- > [email protected]; [email protected]; [email protected]; > [email protected] > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote: > > > > I think that last argument is a straw man - for any kind of interface > > like this, we've ALWAYS just had the rule that any per-whatever > > overrides the generic setting. > > And exactly that is the problem. For file systems we can't support > that sanely. So IFF you absolutely want the per-I/O hints we need > an opt in by the file operations. I've said that at least twice > in this discussion before, but as everyone likes to have political > discussions instead of technical ones no one replied to that. Is it a way forward to add this in a new spin of the series - keeping the temperature mapping on the NVMe side? If not, what would be acceptable for a first version, before getting into adding a new interface to expose agnostic hints? ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-14 7:02 ` Javier Gonzalez @ 2024-10-14 7:47 ` Christoph Hellwig 2024-10-14 9:08 ` Javier Gonzalez 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-14 7:47 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote: > > And exactly that is the problem. For file systems we can't support > > that sanely. So IFF you absolutely want the per-I/O hints we need > > an opt in by the file operations. I've said that at least twice > > in this discussion before, but as everyone likes to have political > > discussions instead of technical ones no one replied to that. > > Is it a way forward to add this in a new spin of the series - keeping the > temperature mapping on the NVMe side? What do you gain from that? NVMe does not understand data temperatures, so why make up that claim? Especially as it directly undermindes any file system work to actually make use of it. > If not, what would be acceptable for a first version, before getting into adding > a new interface to expose agnostic hints? Just iterate on Kanchan's series for a block layer (and possible user level, but that's less urgent) interface for stream separation? ^ permalink raw reply [flat|nested] 86+ messages in thread
* RE: [PATCH v7 0/3] FDP and per-io hints 2024-10-14 7:47 ` Christoph Hellwig @ 2024-10-14 9:08 ` Javier Gonzalez 2024-10-14 11:50 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-14 9:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] > -----Original Message----- > From: Christoph Hellwig <[email protected]> > Sent: Monday, October 14, 2024 9:47 AM > To: Javier Gonzalez <[email protected]> > Cc: Christoph Hellwig <[email protected]>; Jens Axboe <[email protected]>; Keith Busch > <[email protected]>; Martin K. Petersen <[email protected]>; Kanchan > Joshi <[email protected]>; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; linux- > [email protected]; [email protected]; [email protected]; > [email protected]; [email protected]; [email protected] > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote: > > > And exactly that is the problem. For file systems we can't support > > > that sanely. So IFF you absolutely want the per-I/O hints we need > > > an opt in by the file operations. I've said that at least twice > > > in this discussion before, but as everyone likes to have political > > > discussions instead of technical ones no one replied to that. > > > > Is it a way forward to add this in a new spin of the series - keeping the > > temperature mapping on the NVMe side? > > What do you gain from that? NVMe does not understand data temperatures, > so why make up that claim? I expressed this a couple of times in this thread. It is no problem to map temperatures to a protocol that does not understand the semantics. It is not perfect, and in time I agree that we would benefit from exposing the raw hints without semantics from Block layer upwards. But the temperatures are already there. We are not adding anything new. And thanks to this, we can enable existing users with _minimal_ changes to user-space. As pointed on the XFS zoned discussions, this is very similar to what you guys are doing (exactly to re-use what it is already there), and it works. On top of this, applications that already understand temperatures (e.g., RocksDB) will be able to leverage FDP SSDs without changes. > Especially as it directly undermindes any file system work to actually make use of it. I do not think it does. If a FS wants to use the temperatures, then they would be able to leverage FDP besides SCSI. And if we come up with a better interface later on, we can make the changes then. I really do not see the issue. If we were adding a temperature abstraction now, I would agree with You that we would need to cover the use-case you mention for FSs from the beginning, but this Is already here. Seems like a fair compromise to support current users. > > > If not, what would be acceptable for a first version, before getting into adding > > a new interface to expose agnostic hints? > > Just iterate on Kanchan's series for a block layer (and possible user level, > but that's less urgent) interface for stream separation? We can work on this later, but it is not that easy. Look at the mail I forwarded Form Christian. We now can store hints in the same structure as the temperatures. I believe that we would be able to support 128 hints. If we are serious about supporting hints as a general interface, 128 hints is not nearly enough. Moreover: - How do we convince VFS folks to give us more space for hints at this point? - How do we make agnostic hints and temperature co-exist? Do we use a bit to select and create a protocol? This seems over-complicated - When we are exposing general hints to the block layer, how do we support N:N FS:Block_Devices? and DMs? This is not obvious, and it goes well beyond what we need today, but we probably need to think about it. And these are just questions that we know. You see that this is a big lift (which we can work together on!), but will delay current users by months. And honestly, I do not feel comfortable doing this alone; we need application and FS folks like you to help guide this so that it is really usable. I say it again: Your idea about properly supporting hints in FS is good, and I think we should work on it. But I do not believe that the current patches (with added things like opt in file ops) get in the way of this. So if the disagreement is if these patches are harmful, let's talk about this explicitly and try to agree; we do not need to go over the need for a new hint interface. We already agree on this. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-14 9:08 ` Javier Gonzalez @ 2024-10-14 11:50 ` Christoph Hellwig 2024-10-15 3:07 ` Javier Gonzalez 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-14 11:50 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: [can you fix your mailer please, no full quotes, and especially not quotes of the mail headers] > > What do you gain from that? NVMe does not understand data temperatures, > > so why make up that claim? > > I expressed this a couple of times in this thread. It is no problem to > map temperatures to a protocol that does not understand the semantics. And I've agreed every time with you. But the important point is that we must not do it in the driver where all context is lost. > > Especially as it directly undermindes any file system work to actually make use of it. > > I do not think it does. If a FS wants to use the temperatures, then they > would be able to leverage FDP besides SCSI. What do you mean with that? This is a bit too much whitepaper vocabularly. We have code in XFS that can make use of the temperature hint. But to make them work it actually needs to do real stream separation on the device. I.e. the file system consumes the temperature hints. > And if we come up with a better interface later on, we can make the changes then. > I really do not see the issue. If we were adding a temperature abstraction now, I would agree with > You that we would need to cover the use-case you mention for FSs from the beginning, but this > Is already here. Seems like a fair compromise to support current users. Again, I think the temperature hints at the syscall level aren't all bad. There's definitively a few things I'd like to do better in hindsight, but that's not the point. The problem is trying to turn them into stream separation all the way down in the driver, which is fundamentally broken. > - How do we convince VFS folks to give us more space for hints at this point? What space from VFS folks do you need for hints? And why does it matter? ^ permalink raw reply [flat|nested] 86+ messages in thread
* RE: [PATCH v7 0/3] FDP and per-io hints 2024-10-14 11:50 ` Christoph Hellwig @ 2024-10-15 3:07 ` Javier Gonzalez 2024-10-15 5:30 ` Christoph Hellwig 0 siblings, 1 reply; 86+ messages in thread From: Javier Gonzalez @ 2024-10-15 3:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] > On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: > > > Especially as it directly undermindes any file system work to actually make use > of it. > > > > I do not think it does. If a FS wants to use the temperatures, then they > > would be able to leverage FDP besides SCSI. > > What do you mean with that? This is a bit too much whitepaper vocabularly. > > We have code in XFS that can make use of the temperature hint. But to > make them work it actually needs to do real stream separation on the > device. I.e. the file system consumes the temperature hints. The device can guarantee the stream separation without knowing the temperature. > > And if we come up with a better interface later on, we can make the changes > then. > > I really do not see the issue. If we were adding a temperature abstraction now, I > would agree with > > You that we would need to cover the use-case you mention for FSs from the > beginning, but this > > Is already here. Seems like a fair compromise to support current users. > > Again, I think the temperature hints at the syscall level aren't all > bad. There's definitively a few things I'd like to do better in hindsight, > but that's not the point. The problem is trying to turn them into > stream separation all the way down in the driver, which is fundamentally > broken. > > > - How do we convince VFS folks to give us more space for hints at this point? > > What space from VFS folks do you need for hints? And why does it > matter? We need space in the inode to store the hint ID. Look, this feels like going in circles. All this gaslighting is what makes it difficult to push patches when you just do not like the feature. It is the 3rd time I propose you a way forward and you simply cannot provide any specific technical feedback - in the past email I posted several questions about the interface you seem to be talking about and you explicitly omit that. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-15 3:07 ` Javier Gonzalez @ 2024-10-15 5:30 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-15 5:30 UTC (permalink / raw) To: Javier Gonzalez Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Martin K. Petersen, Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On Tue, Oct 15, 2024 at 03:07:57AM +0000, Javier Gonzalez wrote: > > On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: > > > > Especially as it directly undermindes any file system work to actually make use > > of it. > > > > > > I do not think it does. If a FS wants to use the temperatures, then they > > > would be able to leverage FDP besides SCSI. > > > > What do you mean with that? This is a bit too much whitepaper vocabularly. > > > > We have code in XFS that can make use of the temperature hint. But to > > make them work it actually needs to do real stream separation on the > > device. I.e. the file system consumes the temperature hints. > > The device can guarantee the stream separation without knowing the temperature. Of course. But that's entirely not the point. > > What space from VFS folks do you need for hints? And why does it > > matter? > > We need space in the inode to store the hint ID. > > Look, this feels like going in circles. All this gaslighting is what makes it difficult to I'm so fucking tired of this. You are not even listening to my arguments at all, even when explaing every detail to you again and again. And then you acuse me of "gaslighting". > push patches when you just do not like the feature. It is the 3rd time I propose you > a way forward and you simply cannot provide any specific technical feedback - in the > past email I posted several questions about the interface you seem to be talking > about and you explicitly omit that. ??? You are throwing random buzzwords and not even replying to all the low level technical points. If you think I missed something please just ask again instead of this crap. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-09 15:06 ` Keith Busch [not found] ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com> @ 2024-10-10 9:10 ` Christoph Hellwig 1 sibling, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-10 9:10 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Javier González, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g On Wed, Oct 09, 2024 at 09:06:25AM -0600, Keith Busch wrote: > > anything to the reclaim unit? Or is this another of the cases where > > as a hyperscaler you just "know" from the data sheet? > > As far as I know, this is an inconsequential spec detail that is not > being considered by any applications testing this. And yet, the expected > imrpovements are still there, so I don't see a point holding this up for > that reason. It was the whole point of the thing, and a major source of complexity. Although not quite for this use case Hans has numbers that aligning application data tables / objects / etc to the underlying "erase unit" absolutely matters. And just to make it clear the objection here is not that we have an an interface that doesn't take that into respect at the syscall level. We already have that interface and might as well make use of that. The problem is that the series tries to directly expose that to the driver, freezing us into this incomplet interface forever (assuming we have users actually picking it up). That's why I keep insisting like a broken record that we need to get this lower interface right first. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-09 9:28 ` Christoph Hellwig 2024-10-09 15:06 ` Keith Busch @ 2024-10-09 16:28 ` Nitesh Shetty 1 sibling, 0 replies; 86+ messages in thread From: Nitesh Shetty @ 2024-10-09 16:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Javier González, Jens Axboe, Martin K. Petersen, Kanchan Joshi, hare, sagi, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g [-- Attachment #1: Type: text/plain, Size: 1051 bytes --] On 09/10/24 11:28AM, Christoph Hellwig wrote: >On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: >> Then let's just continue with patches 1 and 2. They introduce no new >> user or kernel APIs, and people have already reported improvements using >> it. > >They are still not any way actually exposing the FDP functionality >in the standard though. How is your application going to align >anything to the reclaim unit? Or is this another of the cases where >as a hyperscaler you just "know" from the data sheet? I think Keith already[1] mentioned a couple of times in previous replies, this is an optional feature, if experiments doesn't give better results, FDP hints can be disabled. > >But also given that the submitter completely disappeared and refuses >to even discuss his patches I thing they are simply abandonware at >this point anyway. Kanchan is away due to personal reasons, he is expected to be back in a week or so. Regards, Nitesh Shetty [1] https://lore.kernel.org/all/[email protected]/ [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 15:13 ` Christoph Hellwig 2024-10-02 15:17 ` Keith Busch @ 2024-10-02 15:22 ` Jens Axboe 1 sibling, 0 replies; 86+ messages in thread From: Jens Axboe @ 2024-10-02 15:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/2/24 9:13 AM, Christoph Hellwig wrote: > On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote: >>> The previous stream separation approach made total sense, but just >>> needed a fair amount of work. But it closely matches how things work >>> at the hardware and file system level, so it was the right approach. >> >> What am I missing that makes this effort that different from streams? >> Both are a lifetime hint. > > A stream has a lot less strings attached. But hey, don't make me > argue for streams - you pushed the API in despite reservations back > then, and we've learned a lot since. Hah, I remember arguing for it back then! It was just a pain to use, but thankfully we could kill it when devices didn't materialize with a quality of implementation that made them useful. And then we just yanked it. Nothing is forever, and particularly hints is something that can always be ignored. Same for this. >>> Suddenly dropping that and ignoring all comments really feels like >>> someone urgently needs to pull a marketing stunt here. >> >> I think someone is just trying to finally get this feature in, so it >> can get used by customers, after many months of fairly unproductive >> discussion. > > Well, he finally started on the right approach and gave it up after the > first round of feedback. That's just a bit weird. I'm a big fan of keep-it-simple-stupid, and then you can improve it down the line. I think the simple approach stands just fine alone. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-01 9:20 ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig 2024-10-01 15:58 ` James R. Bergsten 2024-10-01 16:18 ` Jens Axboe @ 2024-10-01 16:23 ` Keith Busch 2024-10-02 7:49 ` Christoph Hellwig 2 siblings, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-01 16:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Tue, Oct 01, 2024 at 11:20:48AM +0200, Christoph Hellwig wrote: > Any reason you completely ignored my feedback on the last version > and did not even answer? > > That's not a very productive way to work. I think because he's getting conflicting feedback. The arguments against it being that it's not a good match, however it's the same match created for streams, and no one complained then, and it's an existing user ABI. FDP is a the new "streams", so I really don't see a big deal here. I understand you have use cases that have filesystems intercept the hints for different types of devices, but I don't see these use cases as mutally exclusive. Supporting this one doesn't prevent other uses. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-01 16:23 ` Keith Busch @ 2024-10-02 7:49 ` Christoph Hellwig 2024-10-02 14:56 ` Keith Busch 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-02 7:49 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: > I think because he's getting conflicting feedback. The arguments against > it being that it's not a good match, however it's the same match created > for streams, and no one complained then, and it's an existing user ABI. People complained, and the streams code got removed pretty quickly because it did not work as expected. I don't think that counts as a big success. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 7:49 ` Christoph Hellwig @ 2024-10-02 14:56 ` Keith Busch 2024-10-02 15:00 ` Jens Axboe 0 siblings, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-02 14:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote: > On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: > > I think because he's getting conflicting feedback. The arguments against > > it being that it's not a good match, however it's the same match created > > for streams, and no one complained then, and it's an existing user ABI. > > People complained, and the streams code got removed pretty quickly > because it did not work as expected. I don't think that counts as > a big success. I don't think the kernel API was the problem. Capable devices never materialized, so the code wasn't doing anything useful. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-02 14:56 ` Keith Busch @ 2024-10-02 15:00 ` Jens Axboe 0 siblings, 0 replies; 86+ messages in thread From: Jens Axboe @ 2024-10-02 15:00 UTC (permalink / raw) To: Keith Busch, Christoph Hellwig Cc: Kanchan Joshi, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/2/24 8:56 AM, Keith Busch wrote: > On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote: >> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: >>> I think because he's getting conflicting feedback. The arguments against >>> it being that it's not a good match, however it's the same match created >>> for streams, and no one complained then, and it's an existing user ABI. >> >> People complained, and the streams code got removed pretty quickly >> because it did not work as expected. I don't think that counts as >> a big success. > > I don't think the kernel API was the problem. Capable devices never > materialized, so the code wasn't doing anything useful. Exactly. I never saw ones that kept the stream persistent across GC, and hence they ended up being pretty useless in practice. -- Jens Axboe ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi ` (3 preceding siblings ...) 2024-10-01 9:20 ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig @ 2024-10-03 0:20 ` Bart Van Assche 2024-10-15 5:50 ` Christoph Hellwig 5 siblings, 0 replies; 86+ messages in thread From: Bart Van Assche @ 2024-10-03 0:20 UTC (permalink / raw) To: Kanchan Joshi, axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence Cc: linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 9/30/24 11:13 AM, Kanchan Joshi wrote: > Another spin to incorporate the feedback from LPC and previous > iterations. The series adds two capabilities: > - FDP support at NVMe level (patch #1) > - Per-io hinting via io_uring (patch #3) > Patch #2 is needed to do per-io hints. > > The motivation and interface details are present in the commit > descriptions. Something is missing from the patch descriptions. What is the goal of this patch series? Is the goal to support FDP in filesystems in the kernel, is the goal to support filesystems implemented in user space or perhaps something else? I think this should be mentioned in the cover letter. Thanks, Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi ` (4 preceding siblings ...) 2024-10-03 0:20 ` Bart Van Assche @ 2024-10-15 5:50 ` Christoph Hellwig 2024-10-15 15:09 ` Keith Busch 2024-10-17 14:35 ` Kanchan Joshi 5 siblings, 2 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-15 5:50 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, kbusch, hch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz As the current discussion is probably too tiring for anyone who is not a professional mud fighter I'll summarize my objections and comments here once again in a way that even aspiring middle managers should be able understand. 1) While the current per-file temperature hints interface is not perfect it is okay and make sense to reuse until we need something more fancy. We make good use of it in f2fs and the upcoming zoned xfs code to help with data placement and have numbers to show that it helps. 2) A per-I/O interface to set these temperature hint conflicts badly with how placement works in file systems. If we have an urgent need for it on the block device it needs to be opt-in by the file operations so it can be enabled on block device, but not on file systems by default. This way you can implement it for block device, but not provide it on file systems by default. If a given file system finds a way to implement it it can still opt into implementing it of course. 3) Mapping from temperature hints to separate write streams needs to happen above the block layer, because file systems need to be in control of it to do intelligent placement. That means if you want to map from temperature hints to stream separation it needs to be implemented at the file operation layer, not in the device driver. The mapping implemented in this series is probably only useful for block devices. Maybe if dumb file systems want to adopt it, it could be split into library code for reuse, but as usual that's probably best done only when actually needed. 4) To support this the block layer, that is bios and requests need to support a notion of stream separation. Kanchan's previous series had most of the bits for that, it just needs to be iterated on. All of this could have probably be easily done in the time spent on this discussion. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-15 5:50 ` Christoph Hellwig @ 2024-10-15 15:09 ` Keith Busch 2024-10-15 15:22 ` Christoph Hellwig 2024-10-17 14:35 ` Kanchan Joshi 1 sibling, 1 reply; 86+ messages in thread From: Keith Busch @ 2024-10-15 15:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote: > 1) While the current per-file temperature hints interface is not perfect > it is okay and make sense to reuse until we need something more fancy. > We make good use of it in f2fs and the upcoming zoned xfs code to help > with data placement and have numbers to show that it helps. So we're okay to proceed with patch 1? > 2) A per-I/O interface to set these temperature hint conflicts badly > with how placement works in file systems. If we have an urgent need > for it on the block device it needs to be opt-in by the file operations > so it can be enabled on block device, but not on file systems by > default. This way you can implement it for block device, but not > provide it on file systems by default. If a given file system finds > a way to implement it it can still opt into implementing it of course. If we add a new fop_flag that only block fops enables, then it's okay? > 3) Mapping from temperature hints to separate write streams needs to > happen above the block layer, because file systems need to be in > control of it to do intelligent placement. That means if you want to > map from temperature hints to stream separation it needs to be > implemented at the file operation layer, not in the device driver. > The mapping implemented in this series is probably only useful for > block devices. Maybe if dumb file systems want to adopt it, it could > be split into library code for reuse, but as usual that's probably > best done only when actually needed. IMO, I don't even think the io_uring per-io hint needs to be limited to the fcntl lifetime values. It could just be a u16 value opaque to the block layer that just gets forwarded to the device. > 4) To support this the block layer, that is bios and requests need > to support a notion of stream separation. Kanchan's previous series > had most of the bits for that, it just needs to be iterated on. > > All of this could have probably be easily done in the time spent on > this discussion. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-15 15:09 ` Keith Busch @ 2024-10-15 15:22 ` Christoph Hellwig 0 siblings, 0 replies; 86+ messages in thread From: Christoph Hellwig @ 2024-10-15 15:22 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Tue, Oct 15, 2024 at 09:09:20AM -0600, Keith Busch wrote: > On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote: > > 1) While the current per-file temperature hints interface is not perfect > > it is okay and make sense to reuse until we need something more fancy. > > We make good use of it in f2fs and the upcoming zoned xfs code to help > > with data placement and have numbers to show that it helps. > > So we're okay to proceed with patch 1? No, see point 3 and 4 below for why. We'll need something like the interface you suggested by me in point 4 and by you in reply to point 3 in the block layer, and then block/fops.c can implement the mapping on top of that for drivers supporting it. > > > 2) A per-I/O interface to set these temperature hint conflicts badly > > with how placement works in file systems. If we have an urgent need > > for it on the block device it needs to be opt-in by the file operations > > so it can be enabled on block device, but not on file systems by > > default. This way you can implement it for block device, but not > > provide it on file systems by default. If a given file system finds > > a way to implement it it can still opt into implementing it of course. > > If we add a new fop_flag that only block fops enables, then it's okay? The flag is just one part of it. Of course it need to be discoverable from userspace in one way or another, and the marshalling of the flag needs to be controller by the file system / fops instance. > > 3) Mapping from temperature hints to separate write streams needs to > > happen above the block layer, because file systems need to be in > > control of it to do intelligent placement. That means if you want to > > map from temperature hints to stream separation it needs to be > > implemented at the file operation layer, not in the device driver. > > The mapping implemented in this series is probably only useful for > > block devices. Maybe if dumb file systems want to adopt it, it could > > be split into library code for reuse, but as usual that's probably > > best done only when actually needed. > > IMO, I don't even think the io_uring per-io hint needs to be limited to > the fcntl lifetime values. It could just be a u16 value opaque to the > block layer that just gets forwarded to the device. Well, that's what I've been arguing for all the time, and what Kanchan's previous series was working towards. It's not quite as trivial as we need a bit more than just the stream, e.g. a way to discover how many of them exist. > > 4) To support this the block layer, that is bios and requests need > > to support a notion of stream separation. Kanchan's previous series > > had most of the bits for that, it just needs to be iterated on. > > > > All of this could have probably be easily done in the time spent on > > this discussion. ---end quoted text--- ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-15 5:50 ` Christoph Hellwig 2024-10-15 15:09 ` Keith Busch @ 2024-10-17 14:35 ` Kanchan Joshi 2024-10-17 15:23 ` Christoph Hellwig 1 sibling, 1 reply; 86+ messages in thread From: Kanchan Joshi @ 2024-10-17 14:35 UTC (permalink / raw) To: Christoph Hellwig Cc: axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz Seems per-I/O hints are not getting the love they deserve. Apart from the block device, the usecase is when all I/Os of VM (or container) are to be grouped together or placed differently. Per-IO hints are fine-granular (enough for userspace to build per-process/vm/file/whatever etc.) and add the flexibility we have lacked so far. As for conflict, I doubt if that exists. Please see below: > 2) A per-I/O interface to set these temperature hint conflicts badly > with how placement works in file systems. If we have an urgent need > for it on the block device it needs to be opt-in by the file operations > so it can be enabled on block device, but not on file systems by > default. This way you can implement it for block device, but not > provide it on file systems by default. If a given file system finds > a way to implement it it can still opt into implementing it of course. Why do you see this as something that is so different across filesystems that they would need to "find a way to implement"? Both per-file and per-io hints are supplied by userspace. Inode and kiocb only happen to be the mean to receive the hint information. FS is free to use this information (iff it wants) or simply forward this down. Per-file hint just gets stored (within inode) without individual FS involvement. Per-io hint follows the same model (i.e., it is set by upper layer like io_uring/aio) and uses kiocb to store the hint. It does not alter the stored inode hint value! After patch #3, filesystems have both per-file and per-io information available. And as before, they can use that hint info (from kiocb or inode) and/or simply forward. The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) already forwards the incoming hints, without any intelligence. And we need just that because with user-passed hints, the onus of intelligence is on the userspace. This is how hints work on xfs/ext4/btrfs files at this point. The owner of the file (filesystem, block, whatever) can still use the incoming hints to do any custom/extra feature. Either from inode or from kiocb depending on what information (per-file or per-io hint) it finds more useful for that custom feature. For example, F2FS is using per-inode information to do something custom and that part has not been changed by these patches. Overall, I do not see the conflict. It's all user-driven. No? For the currently uncommon case when FS decides to generate its own hints (as opposed to userspace hint consumer/forwarder), it can directly put the hint value in bio. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 14:35 ` Kanchan Joshi @ 2024-10-17 15:23 ` Christoph Hellwig 2024-10-17 15:44 ` Keith Busch 0 siblings, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-17 15:23 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, axboe, kbusch, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 17, 2024 at 08:05:38PM +0530, Kanchan Joshi wrote: > Seems per-I/O hints are not getting the love they deserve. > Apart from the block device, the usecase is when all I/Os of VM (or > container) are to be grouped together or placed differently. But that assumes the file system could actually support it. Which is hard when you don't assume the file system isn't simply a passthrough entity, which will not give you great results. > > 2) A per-I/O interface to set these temperature hint conflicts badly > > with how placement works in file systems. If we have an urgent need > > for it on the block device it needs to be opt-in by the file operations > > so it can be enabled on block device, but not on file systems by > > default. This way you can implement it for block device, but not > > provide it on file systems by default. If a given file system finds > > a way to implement it it can still opt into implementing it of course. > > Why do you see this as something that is so different across filesystems > that they would need to "find a way to implement"? If you want to do useful stream separation you need to write data sequentially into the stream. Now with streams or FDP that does not actually imply sequentially in LBA space, but if you want the file system to not actually deal with fragmentation from hell, and be easily track what is grouped together you really want it sequentially in the LBA space as well. In other words, any kind of write placement needs to be intimately tied to the file system block allocator. > Both per-file and per-io hints are supplied by userspace. Inode and > kiocb only happen to be the mean to receive the hint information. > FS is free to use this information (iff it wants) or simply forward this > down. As mentioned above just passing it down is not actually very useful. It might give you nice benchmark numbers when you basically reimplement space management in userspace on a fully preallocated file, but for that you're better of just using the block device. If you actually want to treat the files as files you need full file system involvement. > Per-file hint just gets stored (within inode) without individual FS > involvement. Per-io hint follows the same model (i.e., it is set by > upper layer like io_uring/aio) and uses kiocb to store the hint. It does > not alter the stored inode hint value! Yes, and now you'll get complaints that the file system ignores it when it can't properly support it. This is why we need a per-fop opt in. > The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) > already forwards the incoming hints, without any intelligence. Yes, and that is a problem. We stopped doing that, but Samsung sneaked some of this back in recently as I noticed. > Overall, I do not see the conflict. It's all user-driven. No? I have the gut feeling that you've just run benchmarks on image files emulating block devices and not actually tried real file system workloads based on this unfortunately. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 15:23 ` Christoph Hellwig @ 2024-10-17 15:44 ` Keith Busch 2024-10-17 15:46 ` Christoph Hellwig 2024-10-17 16:15 ` Bart Van Assche 0 siblings, 2 replies; 86+ messages in thread From: Keith Busch @ 2024-10-17 15:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: > If you want to do useful stream separation you need to write data > sequentially into the stream. Now with streams or FDP that does not > actually imply sequentially in LBA space, but if you want the file > system to not actually deal with fragmentation from hell, and be > easily track what is grouped together you really want it sequentially > in the LBA space as well. In other words, any kind of write placement > needs to be intimately tied to the file system block allocator. I'm replying just to make sure I understand what you're saying: If we send per IO hints on a file, we could have interleaved hot and cold pages at various offsets of that file, so the filesystem needs an efficient way to allocate extents and track these so that it doesn't interleave these in LBA space. I think that makes sense. We can add a fop_flags and block/fops.c can be the first one to turn it on since that LBA access is entirely user driven. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 15:44 ` Keith Busch @ 2024-10-17 15:46 ` Christoph Hellwig 2024-10-17 16:06 ` Keith Busch 2024-10-17 16:15 ` Bart Van Assche 1 sibling, 1 reply; 86+ messages in thread From: Christoph Hellwig @ 2024-10-17 15:46 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote: > I'm replying just to make sure I understand what you're saying: > > If we send per IO hints on a file, we could have interleaved hot and > cold pages at various offsets of that file, so the filesystem needs an > efficient way to allocate extents and track these so that it doesn't > interleave these in LBA space. I think that makes sense. That's a little simplified, but yes. > We can add a fop_flags and block/fops.c can be the first one to turn it > on since that LBA access is entirely user driven. Yes, that's my main request on the per-I/O hint interface. Now we just need to not dumb down the bio level interface to five temeperature level and just expose the different write streams and we can all have a happy Kumbaya. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 15:46 ` Christoph Hellwig @ 2024-10-17 16:06 ` Keith Busch 0 siblings, 0 replies; 86+ messages in thread From: Keith Busch @ 2024-10-17 16:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, bvanassche, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 17, 2024 at 05:46:49PM +0200, Christoph Hellwig wrote: > On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote: > > > We can add a fop_flags and block/fops.c can be the first one to turn it > > on since that LBA access is entirely user driven. > > Yes, that's my main request on the per-I/O hint interface. > > Now we just need to not dumb down the bio level interface to five > temeperature level and just expose the different write streams and we > can all have a happy Kumbaya. I'm sending a revised version of this patch set that attempts to address these issues. The io_uring hint is weird to me, so that part is simplified just to get the point across. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 15:44 ` Keith Busch 2024-10-17 15:46 ` Christoph Hellwig @ 2024-10-17 16:15 ` Bart Van Assche 2024-10-17 16:23 ` Keith Busch 1 sibling, 1 reply; 86+ messages in thread From: Bart Van Assche @ 2024-10-17 16:15 UTC (permalink / raw) To: Keith Busch, Christoph Hellwig Cc: Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On 10/17/24 8:44 AM, Keith Busch wrote: > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: >> If you want to do useful stream separation you need to write data >> sequentially into the stream. Now with streams or FDP that does not >> actually imply sequentially in LBA space, but if you want the file >> system to not actually deal with fragmentation from hell, and be >> easily track what is grouped together you really want it sequentially >> in the LBA space as well. In other words, any kind of write placement >> needs to be intimately tied to the file system block allocator. > > I'm replying just to make sure I understand what you're saying: > > If we send per IO hints on a file, we could have interleaved hot and > cold pages at various offsets of that file, so the filesystem needs an > efficient way to allocate extents and track these so that it doesn't > interleave these in LBA space. I think that makes sense. > > We can add a fop_flags and block/fops.c can be the first one to turn it > on since that LBA access is entirely user driven. Does anyone care about buffered I/O to block devices? When using buffered I/O, the write_hint information from the inode is used and the per I/O write_hint information is ignored. Thanks, Bart. ^ permalink raw reply [flat|nested] 86+ messages in thread
* Re: [PATCH v7 0/3] FDP and per-io hints 2024-10-17 16:15 ` Bart Van Assche @ 2024-10-17 16:23 ` Keith Busch 0 siblings, 0 replies; 86+ messages in thread From: Keith Busch @ 2024-10-17 16:23 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Kanchan Joshi, axboe, hare, sagi, martin.petersen, brauner, viro, jack, jaegeuk, bcrl, dhowells, asml.silence, linux-nvme, linux-fsdevel, io-uring, linux-block, linux-aio, gost.dev, vishak.g, javier.gonz On Thu, Oct 17, 2024 at 09:15:21AM -0700, Bart Van Assche wrote: > On 10/17/24 8:44 AM, Keith Busch wrote: > > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: > > > If you want to do useful stream separation you need to write data > > > sequentially into the stream. Now with streams or FDP that does not > > > actually imply sequentially in LBA space, but if you want the file > > > system to not actually deal with fragmentation from hell, and be > > > easily track what is grouped together you really want it sequentially > > > in the LBA space as well. In other words, any kind of write placement > > > needs to be intimately tied to the file system block allocator. > > > > I'm replying just to make sure I understand what you're saying: > > > > If we send per IO hints on a file, we could have interleaved hot and > > cold pages at various offsets of that file, so the filesystem needs an > > efficient way to allocate extents and track these so that it doesn't > > interleave these in LBA space. I think that makes sense. > > > > We can add a fop_flags and block/fops.c can be the first one to turn it > > on since that LBA access is entirely user driven. > > Does anyone care about buffered I/O to block devices? When using > buffered I/O, the write_hint information from the inode is used and the per > I/O write_hint information is ignored. I'm pretty sure there are applications that use buffered IO on raw block (ex: postgresql), but it's a moot point: the block file_operations that provide the fops_flags also provide the callbacks for O_DIRECT, which is where this matters. We can't really use per-io write_hints on buffered-io. At least not yet, and maybe never. I'm not sure if it makes sense for raw block because the page writes won't necessarily match writes to storage. ^ permalink raw reply [flat|nested] 86+ messages in thread
end of thread, other threads:[~2024-10-17 16:24 UTC | newest] Thread overview: 86+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com> 2024-09-30 18:13 ` [PATCH v7 0/3] FDP and per-io hints Kanchan Joshi [not found] ` <CGME20240930182056epcas5p33f823c00caadf9388b509bafcad86f3d@epcas5p3.samsung.com> 2024-09-30 18:13 ` [PATCH v7 1/3] nvme: enable FDP support Kanchan Joshi 2024-10-02 18:37 ` Bart Van Assche 2024-10-03 12:55 ` Christoph Hellwig [not found] ` <CGME20240930182100epcas5p31a010c225f3c76aa4dc54fced32abd2a@epcas5p3.samsung.com> 2024-09-30 18:13 ` [PATCH v7 2/3] block, fs: restore kiocb based write hint processing Kanchan Joshi [not found] ` <CGME20240930182103epcas5p4c9e91ca3cdf20e900b1425ae45fef81d@epcas5p4.samsung.com> 2024-09-30 18:13 ` [PATCH v7 3/3] io_uring: enable per-io hinting capability Kanchan Joshi 2024-10-02 14:26 ` Pavel Begunkov 2024-10-17 14:58 ` Kanchan Joshi 2024-10-02 18:29 ` Bart Van Assche 2024-10-01 9:20 ` [PATCH v7 0/3] FDP and per-io hints Christoph Hellwig 2024-10-01 15:58 ` James R. Bergsten 2024-10-01 16:18 ` Jens Axboe 2024-10-02 7:51 ` Christoph Hellwig 2024-10-02 15:03 ` Jens Axboe 2024-10-02 15:13 ` Christoph Hellwig 2024-10-02 15:17 ` Keith Busch 2024-10-02 15:19 ` Christoph Hellwig 2024-10-02 15:33 ` Keith Busch 2024-10-03 12:51 ` Christoph Hellwig 2024-10-02 15:47 ` Martin K. Petersen 2024-10-02 18:34 ` Bart Van Assche 2024-10-03 12:55 ` Christoph Hellwig 2024-10-03 21:48 ` Keith Busch 2024-10-03 22:00 ` Bart Van Assche 2024-10-03 22:12 ` Jens Axboe 2024-10-03 22:17 ` Keith Busch 2024-10-04 6:21 ` Javier González 2024-10-04 6:24 ` Christoph Hellwig 2024-10-04 6:59 ` Javier González 2024-10-04 12:32 ` Christoph Hellwig 2024-10-07 11:29 ` Javier González 2024-10-08 12:27 ` Christoph Hellwig 2024-10-03 12:54 ` Christoph Hellwig 2024-10-03 22:14 ` Jens Axboe 2024-10-04 5:31 ` Christoph Hellwig 2024-10-04 6:18 ` Javier González 2024-10-04 6:27 ` Christoph Hellwig 2024-10-04 6:52 ` Javier González 2024-10-04 12:30 ` Christoph Hellwig 2024-10-07 10:10 ` Javier González 2024-10-08 10:06 ` Hans Holmberg 2024-10-09 14:36 ` Javier Gonzalez 2024-10-10 6:40 ` Hans Holmberg 2024-10-10 7:13 ` Javier Gonzalez 2024-10-10 9:20 ` Christoph Hellwig 2024-10-10 12:22 ` Javier Gonzalez 2024-10-11 8:56 ` Christoph Hellwig 2024-10-11 12:21 ` Javier Gonzalez 2024-10-11 16:59 ` Keith Busch 2024-10-10 10:46 ` Hans Holmberg [not found] ` <CGME20241010122734eucas1p1e20a5263a4d69db81b50b8b03608fad1@eucas1p1.samsung.com> 2024-10-10 12:27 ` Javier Gonzalez 2024-10-11 8:59 ` Christoph Hellwig 2024-10-08 12:25 ` Christoph Hellwig 2024-10-08 14:44 ` Keith Busch 2024-10-09 9:28 ` Christoph Hellwig 2024-10-09 15:06 ` Keith Busch [not found] ` <CGME20241010070738eucas1p2057209e5f669f37ca586ad4a619289ed@eucas1p2.samsung.com> 2024-10-10 7:07 ` Javier González 2024-10-10 9:13 ` Christoph Hellwig 2024-10-10 11:59 ` Javier González 2024-10-11 9:02 ` Christoph Hellwig 2024-10-11 17:08 ` Jens Axboe 2024-10-14 6:21 ` Christoph Hellwig 2024-10-14 7:02 ` Javier Gonzalez 2024-10-14 7:47 ` Christoph Hellwig 2024-10-14 9:08 ` Javier Gonzalez 2024-10-14 11:50 ` Christoph Hellwig 2024-10-15 3:07 ` Javier Gonzalez 2024-10-15 5:30 ` Christoph Hellwig 2024-10-10 9:10 ` Christoph Hellwig 2024-10-09 16:28 ` Nitesh Shetty 2024-10-02 15:22 ` Jens Axboe 2024-10-01 16:23 ` Keith Busch 2024-10-02 7:49 ` Christoph Hellwig 2024-10-02 14:56 ` Keith Busch 2024-10-02 15:00 ` Jens Axboe 2024-10-03 0:20 ` Bart Van Assche 2024-10-15 5:50 ` Christoph Hellwig 2024-10-15 15:09 ` Keith Busch 2024-10-15 15:22 ` Christoph Hellwig 2024-10-17 14:35 ` Kanchan Joshi 2024-10-17 15:23 ` Christoph Hellwig 2024-10-17 15:44 ` Keith Busch 2024-10-17 15:46 ` Christoph Hellwig 2024-10-17 16:06 ` Keith Busch 2024-10-17 16:15 ` Bart Van Assche 2024-10-17 16:23 ` Keith Busch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox