* [PATCH 0/3] zone-append support in aio and io-uring
[not found] <CGME20200617172653epcas5p488de50090415eb802e62acc0e23d8812@epcas5p4.samsung.com>
@ 2020-06-17 17:23 ` Kanchan Joshi
[not found] ` <CGME20200617172702epcas5p4dbf4729d31d9a85ab1d261d04f238e61@epcas5p4.samsung.com>
` (5 more replies)
0 siblings, 6 replies; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-17 17:23 UTC (permalink / raw)
To: axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi
This patchset enables issuing zone-append using aio and io-uring direct-io interface.
For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
of the zone to issue append. On completion 'res2' field is used to return
zone-relative offset.
For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
Kanchan Joshi (1):
aio: add support for zone-append
Selvakumar S (2):
fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
io_uring: add support for zone-append
fs/aio.c | 8 +++++
fs/block_dev.c | 19 +++++++++++-
fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
include/linux/fs.h | 1 +
include/uapi/linux/aio_abi.h | 1 +
include/uapi/linux/io_uring.h | 8 ++++-
6 files changed, 105 insertions(+), 4 deletions(-)
--
2.7.4
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
[not found] ` <CGME20200617172702epcas5p4dbf4729d31d9a85ab1d261d04f238e61@epcas5p4.samsung.com>
@ 2020-06-17 17:23 ` Kanchan Joshi
2020-06-17 19:02 ` Pavel Begunkov
2020-06-18 7:16 ` Damien Le Moal
0 siblings, 2 replies; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-17 17:23 UTC (permalink / raw)
To: axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi
From: Selvakumar S <[email protected]>
Introduce IOCB_ZONE_APPEND flag, which is set in kiocb->ki_flags for
zone-append. Direct I/O submission path uses this flag to send bio with
append op. And completion path uses the same to return zone-relative
offset to upper layer.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
fs/block_dev.c | 19 ++++++++++++++++++-
include/linux/fs.h | 1 +
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e5..4c84b4d0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
/* avoid the need for a I/O completion work item */
if (iocb->ki_flags & IOCB_DSYNC)
op |= REQ_FUA;
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (iocb->ki_flags & IOCB_ZONE_APPEND)
+ op |= REQ_OP_ZONE_APPEND | REQ_NOMERGE;
+#endif
return op;
}
@@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
}
+#ifdef CONFIG_BLK_DEV_ZONED
+static inline long blkdev_bio_end_io_append(struct bio *bio)
+{
+ return (bio->bi_iter.bi_sector %
+ blk_queue_zone_sectors(bio->bi_disk->queue)) << SECTOR_SHIFT;
+}
+#endif
+
static void blkdev_bio_end_io(struct bio *bio)
{
struct blkdev_dio *dio = bio->bi_private;
@@ -307,15 +319,20 @@ static void blkdev_bio_end_io(struct bio *bio)
if (!dio->is_sync) {
struct kiocb *iocb = dio->iocb;
ssize_t ret;
+ long res = 0;
if (likely(!dio->bio.bi_status)) {
ret = dio->size;
iocb->ki_pos += ret;
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (iocb->ki_flags & IOCB_ZONE_APPEND)
+ res = blkdev_bio_end_io_append(bio);
+#endif
} else {
ret = blk_status_to_errno(dio->bio.bi_status);
}
- dio->iocb->ki_complete(iocb, ret, 0);
+ dio->iocb->ki_complete(iocb, ret, res);
if (dio->multi_bio)
bio_put(&dio->bio);
} else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c4ab4d..dc547b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ enum rw_hint {
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
#define IOCB_NOWAIT (1 << 7)
+#define IOCB_ZONE_APPEND (1 << 8)
struct kiocb {
struct file *ki_filp;
--
2.7.4
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/3] aio: add support for zone-append
[not found] ` <CGME20200617172706epcas5p4dcbc164063f58bad95b211b9d6dfbfa9@epcas5p4.samsung.com>
@ 2020-06-17 17:23 ` Kanchan Joshi
2020-06-18 7:33 ` Damien Le Moal
0 siblings, 1 reply; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-17 17:23 UTC (permalink / raw)
To: axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi, Arnav Dawn
Introduce IOCB_CMD_ZONE_APPEND opcode for zone-append. On append
completion zone-relative offset is returned using io_event->res2.
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Arnav Dawn <[email protected]>
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
fs/aio.c | 8 ++++++++
include/uapi/linux/aio_abi.h | 1 +
2 files changed, 9 insertions(+)
diff --git a/fs/aio.c b/fs/aio.c
index 7ecddc2..8b10a55d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1579,6 +1579,10 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb,
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
}
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (iocb->aio_lio_opcode == IOCB_CMD_ZONE_APPEND)
+ req->ki_flags |= IOCB_ZONE_APPEND;
+#endif
req->ki_flags |= IOCB_WRITE;
aio_rw_done(req, call_write_iter(file, req, &iter));
}
@@ -1846,6 +1850,10 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
return aio_fsync(&req->fsync, iocb, true);
case IOCB_CMD_POLL:
return aio_poll(req, iocb);
+#ifdef CONFIG_BLK_DEV_ZONED
+ case IOCB_CMD_ZONE_APPEND:
+ return aio_write(&req->rw, iocb, false, compat);
+#endif
default:
pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
return -EINVAL;
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 8387e0a..541d96a 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -43,6 +43,7 @@ enum {
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
+ IOCB_CMD_ZONE_APPEND = 9,
};
/*
--
2.7.4
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/3] io_uring: add support for zone-append
[not found] ` <CGME20200617172713epcas5p352f2907a12bd4ee3c97be1c7d8e1569e@epcas5p3.samsung.com>
@ 2020-06-17 17:23 ` Kanchan Joshi
2020-06-17 18:55 ` Pavel Begunkov
2020-06-18 7:39 ` Damien Le Moal
0 siblings, 2 replies; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-17 17:23 UTC (permalink / raw)
To: axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi
From: Selvakumar S <[email protected]>
Introduce three new opcodes for zone-append -
IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
Repurpose cqe->flags to return zone-relative offset.
Signed-off-by: SelvaKumar S <[email protected]>
Signed-off-by: Kanchan Joshi <[email protected]>
Signed-off-by: Nitesh Shetty <[email protected]>
Signed-off-by: Javier Gonzalez <[email protected]>
---
fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/io_uring.h | 8 ++++-
2 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
unsigned long fsize;
u64 user_data;
u32 result;
+#ifdef CONFIG_BLK_DEV_ZONED
+ /* zone-relative offset for append, in bytes */
+ u32 append_offset;
+#endif
u32 sequence;
struct list_head link_list;
@@ -875,6 +879,26 @@ static const struct io_op_def io_op_defs[] = {
.hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
+ [IORING_OP_ZONE_APPEND] = {
+ .needs_mm = 1,
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ },
+ [IORING_OP_ZONE_APPENDV] = {
+ .async_ctx = 1,
+ .needs_mm = 1,
+ .needs_file = 1,
+ .hash_reg_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ },
+ [IORING_OP_ZONE_APPEND_FIXED] = {
+ .needs_file = 1,
+ .hash_reg_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ },
};
static void io_wq_submit_work(struct io_wq_work **workptr);
@@ -1285,7 +1309,16 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
if (likely(cqe)) {
WRITE_ONCE(cqe->user_data, req->user_data);
WRITE_ONCE(cqe->res, res);
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (req->opcode == IORING_OP_ZONE_APPEND ||
+ req->opcode == IORING_OP_ZONE_APPENDV ||
+ req->opcode == IORING_OP_ZONE_APPEND_FIXED)
+ WRITE_ONCE(cqe->res2, req->append_offset);
+ else
+ WRITE_ONCE(cqe->flags, cflags);
+#else
WRITE_ONCE(cqe->flags, cflags);
+#endif
} else if (ctx->cq_overflow_flushed) {
WRITE_ONCE(ctx->rings->cq_overflow,
atomic_inc_return(&ctx->cached_cq_overflow));
@@ -1961,6 +1994,9 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
{
struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
+#ifdef CONFIG_BLK_DEV_ZONED
+ req->append_offset = (u32)res2;
+#endif
io_complete_rw_common(kiocb, res);
io_put_req(req);
@@ -1976,6 +2012,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
if (res != req->result)
req_set_fail_links(req);
req->result = res;
+#ifdef CONFIG_BLK_DEV_ZONED
+ req->append_offset = (u32)res2;
+#endif
if (res != -EAGAIN)
WRITE_ONCE(req->iopoll_completed, 1);
}
@@ -2408,7 +2447,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
u8 opcode;
opcode = req->opcode;
- if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
+ if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED ||
+ opcode == IORING_OP_ZONE_APPEND_FIXED) {
*iovec = NULL;
return io_import_fixed(req, rw, iter);
}
@@ -2417,7 +2457,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
if (req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT))
return -EINVAL;
- if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
+ if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE ||
+ opcode == IORING_OP_ZONE_APPEND) {
if (req->flags & REQ_F_BUFFER_SELECT) {
buf = io_rw_buffer_select(req, &sqe_len, needs_lock);
if (IS_ERR(buf)) {
@@ -2704,6 +2745,9 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
req->result = 0;
+#ifdef CONFIG_BLK_DEV_ZONED
+ req->append_offset = 0;
+#endif
io_size = ret;
if (req->flags & REQ_F_LINK_HEAD)
req->result = io_size;
@@ -2738,6 +2782,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
__sb_writers_release(file_inode(req->file)->i_sb,
SB_FREEZE_WRITE);
}
+#ifdef CONFIG_BLK_DEV_ZONED
+ if (req->opcode == IORING_OP_ZONE_APPEND ||
+ req->opcode == IORING_OP_ZONE_APPENDV ||
+ req->opcode == IORING_OP_ZONE_APPEND_FIXED)
+ kiocb->ki_flags |= IOCB_ZONE_APPEND;
+#endif
+
kiocb->ki_flags |= IOCB_WRITE;
if (!force_nonblock)
@@ -4906,6 +4957,12 @@ static int io_req_defer_prep(struct io_kiocb *req,
case IORING_OP_WRITEV:
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE:
+#ifdef CONFIG_BLK_DEV_ZONED
+ fallthrough;
+ case IORING_OP_ZONE_APPEND:
+ case IORING_OP_ZONE_APPENDV:
+ case IORING_OP_ZONE_APPEND_FIXED:
+#endif
ret = io_write_prep(req, sqe, true);
break;
case IORING_OP_POLL_ADD:
@@ -5038,6 +5095,12 @@ static void io_cleanup_req(struct io_kiocb *req)
case IORING_OP_WRITEV:
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE:
+#ifdef CONFIG_BLK_DEV_ZONED
+ fallthrough;
+ case IORING_OP_ZONE_APPEND:
+ case IORING_OP_ZONE_APPENDV:
+ case IORING_OP_ZONE_APPEND_FIXED:
+#endif
if (io->rw.iov != io->rw.fast_iov)
kfree(io->rw.iov);
break;
@@ -5086,6 +5149,11 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
}
ret = io_read(req, force_nonblock);
break;
+#ifdef CONFIG_BLK_DEV_ZONED
+ case IORING_OP_ZONE_APPEND:
+ case IORING_OP_ZONE_APPENDV:
+ case IORING_OP_ZONE_APPEND_FIXED:
+#endif
case IORING_OP_WRITEV:
case IORING_OP_WRITE_FIXED:
case IORING_OP_WRITE:
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 92c2269..6c8e932 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -130,6 +130,9 @@ enum {
IORING_OP_PROVIDE_BUFFERS,
IORING_OP_REMOVE_BUFFERS,
IORING_OP_TEE,
+ IORING_OP_ZONE_APPEND,
+ IORING_OP_ZONE_APPENDV,
+ IORING_OP_ZONE_APPEND_FIXED,
/* this goes last, obviously */
IORING_OP_LAST,
@@ -157,7 +160,10 @@ enum {
struct io_uring_cqe {
__u64 user_data; /* sqe->data submission passed back */
__s32 res; /* result code for this event */
- __u32 flags;
+ union {
+ __u32 res2; /* res2 like aio, currently used for zone-append */
+ __u32 flags;
+ };
};
/*
--
2.7.4
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-17 17:23 ` [PATCH 0/3] zone-append support in aio and io-uring Kanchan Joshi
` (2 preceding siblings ...)
[not found] ` <CGME20200617172713epcas5p352f2907a12bd4ee3c97be1c7d8e1569e@epcas5p3.samsung.com>
@ 2020-06-17 17:42 ` Matthew Wilcox
2020-06-18 6:56 ` Christoph Hellwig
2020-06-18 8:04 ` Matias Bjørling
5 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2020-06-17 17:42 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, viro, bcrl, linux-fsdevel, linux-kernel, linux-aio,
io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz
On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>
> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
> of the zone to issue append. On completion 'res2' field is used to return
> zone-relative offset.
Maybe it's obvious to everyone working with zoned drives on a daily
basis, but please explain in the commit message why you need to return
the zone-relative offset to the application.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-17 17:23 ` [PATCH 3/3] io_uring: " Kanchan Joshi
@ 2020-06-17 18:55 ` Pavel Begunkov
2020-06-18 7:39 ` Damien Le Moal
1 sibling, 0 replies; 40+ messages in thread
From: Pavel Begunkov @ 2020-06-17 18:55 UTC (permalink / raw)
To: Kanchan Joshi, axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz
On 17/06/2020 20:23, Kanchan Joshi wrote:
> From: Selvakumar S <[email protected]>
>
> Introduce three new opcodes for zone-append -
>
> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
I don't like the idea of creating a new set of opcodes for each new flavour.
Did you try to extend write*?
>
> Repurpose cqe->flags to return zone-relative offset.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/io_uring.h | 8 ++++-
> 2 files changed, 77 insertions(+), 3 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 155f3d8..c14c873 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -649,6 +649,10 @@ struct io_kiocb {
> unsigned long fsize;
> u64 user_data;
> u32 result;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + /* zone-relative offset for append, in bytes */
> + u32 append_offset;
> +#endif
What is this doing here? It's related to append only.
Should be in struct io_rw or whatever.
> u32 sequence;
>
> struct list_head link_list;
> @@ -875,6 +879,26 @@ static const struct io_op_def io_op_defs[] = {
> .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> + [IORING_OP_ZONE_APPEND] = {
> + .needs_mm = 1,
> + .needs_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> + [IORING_OP_ZONE_APPENDV] = {
> + .async_ctx = 1,
> + .needs_mm = 1,
> + .needs_file = 1,
> + .hash_reg_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> + [IORING_OP_ZONE_APPEND_FIXED] = {
> + .needs_file = 1,
> + .hash_reg_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> };
>
> static void io_wq_submit_work(struct io_wq_work **workptr);
> @@ -1285,7 +1309,16 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
> if (likely(cqe)) {
> WRITE_ONCE(cqe->user_data, req->user_data);
> WRITE_ONCE(cqe->res, res);
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (req->opcode == IORING_OP_ZONE_APPEND ||
> + req->opcode == IORING_OP_ZONE_APPENDV ||
Nack, don't create special cases in common path.
> + req->opcode == IORING_OP_ZONE_APPEND_FIXED)
> + WRITE_ONCE(cqe->res2, req->append_offset);
Great, a function was asked to return @cflags arg, but it decides to
overwrite the field with its own stuff...
Why not pass it in @cflags?
> + else
> + WRITE_ONCE(cqe->flags, cflags);
> +#else
> WRITE_ONCE(cqe->flags, cflags);
> +#endif
> } else if (ctx->cq_overflow_flushed) {
> WRITE_ONCE(ctx->rings->cq_overflow,
> atomic_inc_return(&ctx->cached_cq_overflow));
> @@ -1961,6 +1994,9 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
> static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
> {
> struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = (u32)res2;
> +#endif
Don't create #ifdef hell all over the code. Try to use a function
>
> io_complete_rw_common(kiocb, res);
> io_put_req(req);
> @@ -1976,6 +2012,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
> if (res != req->result)
> req_set_fail_links(req);
> req->result = res;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = (u32)res2;
> +#endif
> if (res != -EAGAIN)
> WRITE_ONCE(req->iopoll_completed, 1);
> }
> @@ -2408,7 +2447,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
> u8 opcode;
>
> opcode = req->opcode;
> - if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
> + if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED ||
> + opcode == IORING_OP_ZONE_APPEND_FIXED) {
> *iovec = NULL;
> return io_import_fixed(req, rw, iter);
> }
> @@ -2417,7 +2457,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
> if (req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT))
> return -EINVAL;
>
> - if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
> + if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE ||
> + opcode == IORING_OP_ZONE_APPEND) {
> if (req->flags & REQ_F_BUFFER_SELECT) {
> buf = io_rw_buffer_select(req, &sqe_len, needs_lock);
> if (IS_ERR(buf)) {
> @@ -2704,6 +2745,9 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
> req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
>
> req->result = 0;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = 0;
> +#endif
> io_size = ret;
> if (req->flags & REQ_F_LINK_HEAD)
> req->result = io_size;
> @@ -2738,6 +2782,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
> __sb_writers_release(file_inode(req->file)->i_sb,
> SB_FREEZE_WRITE);
> }
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (req->opcode == IORING_OP_ZONE_APPEND ||
> + req->opcode == IORING_OP_ZONE_APPENDV ||
> + req->opcode == IORING_OP_ZONE_APPEND_FIXED)
> + kiocb->ki_flags |= IOCB_ZONE_APPEND;
> +#endif
> +
> kiocb->ki_flags |= IOCB_WRITE;
>
> if (!force_nonblock)
> @@ -4906,6 +4957,12 @@ static int io_req_defer_prep(struct io_kiocb *req,
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> +#ifdef CONFIG_BLK_DEV_ZONED
> + fallthrough;
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> ret = io_write_prep(req, sqe, true);
> break;
> case IORING_OP_POLL_ADD:
> @@ -5038,6 +5095,12 @@ static void io_cleanup_req(struct io_kiocb *req)
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> +#ifdef CONFIG_BLK_DEV_ZONED
> + fallthrough;
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> if (io->rw.iov != io->rw.fast_iov)
> kfree(io->rw.iov);
> break;
> @@ -5086,6 +5149,11 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> }
> ret = io_read(req, force_nonblock);
> break;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 92c2269..6c8e932 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -130,6 +130,9 @@ enum {
> IORING_OP_PROVIDE_BUFFERS,
> IORING_OP_REMOVE_BUFFERS,
> IORING_OP_TEE,
> + IORING_OP_ZONE_APPEND,
> + IORING_OP_ZONE_APPENDV,
> + IORING_OP_ZONE_APPEND_FIXED,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
> @@ -157,7 +160,10 @@ enum {
> struct io_uring_cqe {
> __u64 user_data; /* sqe->data submission passed back */
> __s32 res; /* result code for this event */
> - __u32 flags;
> + union {
> + __u32 res2; /* res2 like aio, currently used for zone-append */
> + __u32 flags;
> + };
> };
>
> /*
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
2020-06-17 17:23 ` [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling Kanchan Joshi
@ 2020-06-17 19:02 ` Pavel Begunkov
2020-06-18 7:16 ` Damien Le Moal
1 sibling, 0 replies; 40+ messages in thread
From: Pavel Begunkov @ 2020-06-17 19:02 UTC (permalink / raw)
To: Kanchan Joshi, axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz
On 17/06/2020 20:23, Kanchan Joshi wrote:
> From: Selvakumar S <[email protected]>
>
> Introduce IOCB_ZONE_APPEND flag, which is set in kiocb->ki_flags for
> zone-append. Direct I/O submission path uses this flag to send bio with
> append op. And completion path uses the same to return zone-relative
> offset to upper layer.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/block_dev.c | 19 ++++++++++++++++++-
> include/linux/fs.h | 1 +
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 47860e5..4c84b4d0 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
> /* avoid the need for a I/O completion work item */
> if (iocb->ki_flags & IOCB_DSYNC)
> op |= REQ_FUA;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
> + op |= REQ_OP_ZONE_APPEND | REQ_NOMERGE;
> +#endif
> return op;
> }
>
> @@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
> return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
> }
>
> +#ifdef CONFIG_BLK_DEV_ZONED
> +static inline long blkdev_bio_end_io_append(struct bio *bio)
> +{
> + return (bio->bi_iter.bi_sector %
> + blk_queue_zone_sectors(bio->bi_disk->queue)) << SECTOR_SHIFT;
IIRC, zone_size is pow2 and bit operations should suffice.
Anyway, this can use a temporary variable.
> +}
> +#endif
> +
> static void blkdev_bio_end_io(struct bio *bio)
> {
> struct blkdev_dio *dio = bio->bi_private;
> @@ -307,15 +319,20 @@ static void blkdev_bio_end_io(struct bio *bio)
> if (!dio->is_sync) {
> struct kiocb *iocb = dio->iocb;
> ssize_t ret;
> + long res = 0;
>
> if (likely(!dio->bio.bi_status)) {
> ret = dio->size;
> iocb->ki_pos += ret;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
> + res = blkdev_bio_end_io_append(bio);
> +#endif
> } else {
> ret = blk_status_to_errno(dio->bio.bi_status);
> }
>
> - dio->iocb->ki_complete(iocb, ret, 0);
> + dio->iocb->ki_complete(iocb, ret, res);
> if (dio->multi_bio)
> bio_put(&dio->bio);
> } else {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c4ab4d..dc547b9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -315,6 +315,7 @@ enum rw_hint {
> #define IOCB_SYNC (1 << 5)
> #define IOCB_WRITE (1 << 6)
> #define IOCB_NOWAIT (1 << 7)
> +#define IOCB_ZONE_APPEND (1 << 8)
>
> struct kiocb {
> struct file *ki_filp;
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-17 17:23 ` [PATCH 0/3] zone-append support in aio and io-uring Kanchan Joshi
` (3 preceding siblings ...)
2020-06-17 17:42 ` [PATCH 0/3] zone-append support in aio and io-uring Matthew Wilcox
@ 2020-06-18 6:56 ` Christoph Hellwig
2020-06-18 8:29 ` Javier González
2020-06-18 17:52 ` Kanchan Joshi
2020-06-18 8:04 ` Matias Bjørling
5 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-18 6:56 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, viro, bcrl, linux-fsdevel, linux-kernel, linux-aio,
io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz
On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>
> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
> of the zone to issue append. On completion 'res2' field is used to return
> zone-relative offset.
>
> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
And what exactly are the semantics supposed to be? Remember the
unix file abstractions does not know about zones at all.
I really don't think squeezing low-level not quite block storage
protocol details into the Linux read/write path is a good idea.
What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
to report where they actually wrote, as that comes close to Zone Append
while still making sense at our usual abstraction level for file I/O.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
2020-06-17 17:23 ` [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling Kanchan Joshi
2020-06-17 19:02 ` Pavel Begunkov
@ 2020-06-18 7:16 ` Damien Le Moal
2020-06-18 18:35 ` Kanchan Joshi
1 sibling, 1 reply; 40+ messages in thread
From: Damien Le Moal @ 2020-06-18 7:16 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 2020/06/18 2:27, Kanchan Joshi wrote:
> From: Selvakumar S <[email protected]>
>
> Introduce IOCB_ZONE_APPEND flag, which is set in kiocb->ki_flags for
> zone-append. Direct I/O submission path uses this flag to send bio with
> append op. And completion path uses the same to return zone-relative
> offset to upper layer.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/block_dev.c | 19 ++++++++++++++++++-
> include/linux/fs.h | 1 +
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 47860e5..4c84b4d0 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
> /* avoid the need for a I/O completion work item */
> if (iocb->ki_flags & IOCB_DSYNC)
> op |= REQ_FUA;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
> + op |= REQ_OP_ZONE_APPEND | REQ_NOMERGE;
> +#endif
No need for the #ifdef. And no need for the REQ_NOMERGE either since
REQ_OP_ZONE_APPEND requests are defined as not mergeable already.
> return op;
> }
>
> @@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
> return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
> }
>
> +#ifdef CONFIG_BLK_DEV_ZONED
> +static inline long blkdev_bio_end_io_append(struct bio *bio)
> +{
> + return (bio->bi_iter.bi_sector %
> + blk_queue_zone_sectors(bio->bi_disk->queue)) << SECTOR_SHIFT;
A zone size is at most 4G sectors as defined by the queue chunk_sectors limit
(unsigned int). It means that the return value here can overflow due to the
shift, at least on 32bits arch.
And as Pavel already commented, zone sizes are power of 2 so you can use
bitmasks instead of costly divisions.
> +}
> +#endif
> +
> static void blkdev_bio_end_io(struct bio *bio)
> {
> struct blkdev_dio *dio = bio->bi_private;
> @@ -307,15 +319,20 @@ static void blkdev_bio_end_io(struct bio *bio)
> if (!dio->is_sync) {
> struct kiocb *iocb = dio->iocb;
> ssize_t ret;
> + long res = 0;
>
> if (likely(!dio->bio.bi_status)) {
> ret = dio->size;
> iocb->ki_pos += ret;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
> + res = blkdev_bio_end_io_append(bio);
overflow... And no need for the #ifdef.
> +#endif
> } else {
> ret = blk_status_to_errno(dio->bio.bi_status);
> }
>
> - dio->iocb->ki_complete(iocb, ret, 0);
> + dio->iocb->ki_complete(iocb, ret, res);
ki_complete interface also needs to be changed to have a 64bit last argument to
avoid overflow.
And this all does not work anyway because __blkdev_direct_IO() and
__blkdev_direct_IO_simple() both call bio_iov_iter_get_pages() *before*
dio_bio_write_op() is called. This means that bio_iov_iter_get_pages() will not
see that it needs to get the pages for a zone append command and so will not
call __bio_iov_append_get_pages(). That leads to BIO split and potential
corruption of the user data as fragments of the kiocb may get reordered.
There is a lot more to do to the block_dev direct IO code for this to work.
> if (dio->multi_bio)
> bio_put(&dio->bio);
> } else {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c4ab4d..dc547b9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -315,6 +315,7 @@ enum rw_hint {
> #define IOCB_SYNC (1 << 5)
> #define IOCB_WRITE (1 << 6)
> #define IOCB_NOWAIT (1 << 7)
> +#define IOCB_ZONE_APPEND (1 << 8)
>
> struct kiocb {
> struct file *ki_filp;
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 2/3] aio: add support for zone-append
2020-06-17 17:23 ` [PATCH 2/3] aio: add support for zone-append Kanchan Joshi
@ 2020-06-18 7:33 ` Damien Le Moal
0 siblings, 0 replies; 40+ messages in thread
From: Damien Le Moal @ 2020-06-18 7:33 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], Arnav Dawn
On 2020/06/18 2:27, Kanchan Joshi wrote:
> Introduce IOCB_CMD_ZONE_APPEND opcode for zone-append. On append
> completion zone-relative offset is returned using io_event->res2.
>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Arnav Dawn <[email protected]>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/aio.c | 8 ++++++++
> include/uapi/linux/aio_abi.h | 1 +
> 2 files changed, 9 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 7ecddc2..8b10a55d 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1579,6 +1579,10 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb,
> __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
> __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
> }
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (iocb->aio_lio_opcode == IOCB_CMD_ZONE_APPEND)
> + req->ki_flags |= IOCB_ZONE_APPEND;
> +#endif
> req->ki_flags |= IOCB_WRITE;
> aio_rw_done(req, call_write_iter(file, req, &iter));
> }
> @@ -1846,6 +1850,10 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
> return aio_fsync(&req->fsync, iocb, true);
> case IOCB_CMD_POLL:
> return aio_poll(req, iocb);
> +#ifdef CONFIG_BLK_DEV_ZONED
> + case IOCB_CMD_ZONE_APPEND:
> + return aio_write(&req->rw, iocb, false, compat);
> +#endif
> default:
> pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
> return -EINVAL;
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index 8387e0a..541d96a 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -43,6 +43,7 @@ enum {
> IOCB_CMD_NOOP = 6,
> IOCB_CMD_PREADV = 7,
> IOCB_CMD_PWRITEV = 8,
> + IOCB_CMD_ZONE_APPEND = 9,
> };
>
> /*
>
No need for all the #ifdefs.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-17 17:23 ` [PATCH 3/3] io_uring: " Kanchan Joshi
2020-06-17 18:55 ` Pavel Begunkov
@ 2020-06-18 7:39 ` Damien Le Moal
2020-06-18 8:35 ` [email protected]
1 sibling, 1 reply; 40+ messages in thread
From: Damien Le Moal @ 2020-06-18 7:39 UTC (permalink / raw)
To: Kanchan Joshi, [email protected], [email protected],
[email protected]
Cc: [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 2020/06/18 2:27, Kanchan Joshi wrote:
> From: Selvakumar S <[email protected]>
>
> Introduce three new opcodes for zone-append -
>
> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>
> Repurpose cqe->flags to return zone-relative offset.
>
> Signed-off-by: SelvaKumar S <[email protected]>
> Signed-off-by: Kanchan Joshi <[email protected]>
> Signed-off-by: Nitesh Shetty <[email protected]>
> Signed-off-by: Javier Gonzalez <[email protected]>
> ---
> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
> include/uapi/linux/io_uring.h | 8 ++++-
> 2 files changed, 77 insertions(+), 3 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 155f3d8..c14c873 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -649,6 +649,10 @@ struct io_kiocb {
> unsigned long fsize;
> u64 user_data;
> u32 result;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + /* zone-relative offset for append, in bytes */
> + u32 append_offset;
this can overflow. u64 is needed.
> +#endif
> u32 sequence;
>
> struct list_head link_list;
> @@ -875,6 +879,26 @@ static const struct io_op_def io_op_defs[] = {
> .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> + [IORING_OP_ZONE_APPEND] = {
> + .needs_mm = 1,
> + .needs_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> + [IORING_OP_ZONE_APPENDV] = {
> + .async_ctx = 1,
> + .needs_mm = 1,
> + .needs_file = 1,
> + .hash_reg_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> + [IORING_OP_ZONE_APPEND_FIXED] = {
> + .needs_file = 1,
> + .hash_reg_file = 1,
> + .unbound_nonreg_file = 1,
> + .pollout = 1,
> + },
> };
>
> static void io_wq_submit_work(struct io_wq_work **workptr);
> @@ -1285,7 +1309,16 @@ static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
> if (likely(cqe)) {
> WRITE_ONCE(cqe->user_data, req->user_data);
> WRITE_ONCE(cqe->res, res);
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (req->opcode == IORING_OP_ZONE_APPEND ||
> + req->opcode == IORING_OP_ZONE_APPENDV ||
> + req->opcode == IORING_OP_ZONE_APPEND_FIXED)
> + WRITE_ONCE(cqe->res2, req->append_offset);
> + else
> + WRITE_ONCE(cqe->flags, cflags);
> +#else
> WRITE_ONCE(cqe->flags, cflags);
> +#endif
> } else if (ctx->cq_overflow_flushed) {
> WRITE_ONCE(ctx->rings->cq_overflow,
> atomic_inc_return(&ctx->cached_cq_overflow));
> @@ -1961,6 +1994,9 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
> static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
> {
> struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = (u32)res2;
> +#endif
>
> io_complete_rw_common(kiocb, res);
> io_put_req(req);
> @@ -1976,6 +2012,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
> if (res != req->result)
> req_set_fail_links(req);
> req->result = res;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = (u32)res2;
> +#endif
> if (res != -EAGAIN)
> WRITE_ONCE(req->iopoll_completed, 1);
> }
> @@ -2408,7 +2447,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
> u8 opcode;
>
> opcode = req->opcode;
> - if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
> + if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED ||
> + opcode == IORING_OP_ZONE_APPEND_FIXED) {
> *iovec = NULL;
> return io_import_fixed(req, rw, iter);
> }
> @@ -2417,7 +2457,8 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
> if (req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT))
> return -EINVAL;
>
> - if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) {
> + if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE ||
> + opcode == IORING_OP_ZONE_APPEND) {
> if (req->flags & REQ_F_BUFFER_SELECT) {
> buf = io_rw_buffer_select(req, &sqe_len, needs_lock);
> if (IS_ERR(buf)) {
> @@ -2704,6 +2745,9 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
> req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT;
>
> req->result = 0;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + req->append_offset = 0;
> +#endif
> io_size = ret;
> if (req->flags & REQ_F_LINK_HEAD)
> req->result = io_size;
> @@ -2738,6 +2782,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
> __sb_writers_release(file_inode(req->file)->i_sb,
> SB_FREEZE_WRITE);
> }
> +#ifdef CONFIG_BLK_DEV_ZONED
> + if (req->opcode == IORING_OP_ZONE_APPEND ||
> + req->opcode == IORING_OP_ZONE_APPENDV ||
> + req->opcode == IORING_OP_ZONE_APPEND_FIXED)
> + kiocb->ki_flags |= IOCB_ZONE_APPEND;
> +#endif
> +
> kiocb->ki_flags |= IOCB_WRITE;
>
> if (!force_nonblock)
> @@ -4906,6 +4957,12 @@ static int io_req_defer_prep(struct io_kiocb *req,
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> +#ifdef CONFIG_BLK_DEV_ZONED
> + fallthrough;
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> ret = io_write_prep(req, sqe, true);
> break;
> case IORING_OP_POLL_ADD:
> @@ -5038,6 +5095,12 @@ static void io_cleanup_req(struct io_kiocb *req)
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> +#ifdef CONFIG_BLK_DEV_ZONED
> + fallthrough;
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> if (io->rw.iov != io->rw.fast_iov)
> kfree(io->rw.iov);
> break;
> @@ -5086,6 +5149,11 @@ static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> }
> ret = io_read(req, force_nonblock);
> break;
> +#ifdef CONFIG_BLK_DEV_ZONED
> + case IORING_OP_ZONE_APPEND:
> + case IORING_OP_ZONE_APPENDV:
> + case IORING_OP_ZONE_APPEND_FIXED:
> +#endif
> case IORING_OP_WRITEV:
> case IORING_OP_WRITE_FIXED:
> case IORING_OP_WRITE:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 92c2269..6c8e932 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -130,6 +130,9 @@ enum {
> IORING_OP_PROVIDE_BUFFERS,
> IORING_OP_REMOVE_BUFFERS,
> IORING_OP_TEE,
> + IORING_OP_ZONE_APPEND,
> + IORING_OP_ZONE_APPENDV,
> + IORING_OP_ZONE_APPEND_FIXED,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
> @@ -157,7 +160,10 @@ enum {
> struct io_uring_cqe {
> __u64 user_data; /* sqe->data submission passed back */
> __s32 res; /* result code for this event */
> - __u32 flags;
> + union {
> + __u32 res2; /* res2 like aio, currently used for zone-append */
> + __u32 flags;
> + };
> };
>
> /*
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-17 17:23 ` [PATCH 0/3] zone-append support in aio and io-uring Kanchan Joshi
` (4 preceding siblings ...)
2020-06-18 6:56 ` Christoph Hellwig
@ 2020-06-18 8:04 ` Matias Bjørling
2020-06-18 8:27 ` Javier González
` (2 more replies)
5 siblings, 3 replies; 40+ messages in thread
From: Matias Bjørling @ 2020-06-18 8:04 UTC (permalink / raw)
To: Kanchan Joshi, axboe, viro, bcrl
Cc: linux-fsdevel, linux-kernel, linux-aio, io-uring, linux-block,
selvakuma.s1, nj.shetty, javier.gonz, Damien Le Moal, Keith Busch,
Christoph Hellwig
On 17/06/2020 19.23, Kanchan Joshi wrote:
> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>
> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
> of the zone to issue append. On completion 'res2' field is used to return
> zone-relative offset.
>
> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
Please provide a pointers to applications that are updated and ready to
take advantage of zone append.
I do not believe it's beneficial at this point to change the libaio API,
applications that would want to use this API, should anyway switch to
use io_uring.
Please also note that applications and libraries that want to take
advantage of zone append, can already use the zonefs file-system, as it
will use the zone append command when applicable.
> Kanchan Joshi (1):
> aio: add support for zone-append
>
> Selvakumar S (2):
> fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
> io_uring: add support for zone-append
>
> fs/aio.c | 8 +++++
> fs/block_dev.c | 19 +++++++++++-
> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
> include/linux/fs.h | 1 +
> include/uapi/linux/aio_abi.h | 1 +
> include/uapi/linux/io_uring.h | 8 ++++-
> 6 files changed, 105 insertions(+), 4 deletions(-)
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:04 ` Matias Bjørling
@ 2020-06-18 8:27 ` Javier González
2020-06-18 8:32 ` Matias Bjørling
2020-06-18 14:16 ` Christoph Hellwig
2020-06-18 19:21 ` Kanchan Joshi
2 siblings, 1 reply; 40+ messages in thread
From: Javier González @ 2020-06-18 8:27 UTC (permalink / raw)
To: Matias Bjørling
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
Damien Le Moal, Keith Busch, Christoph Hellwig
On 18.06.2020 10:04, Matias Bjørling wrote:
>On 17/06/2020 19.23, Kanchan Joshi wrote:
>>This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>
>>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>>of the zone to issue append. On completion 'res2' field is used to return
>>zone-relative offset.
>>
>>For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>
>Please provide a pointers to applications that are updated and ready
>to take advantage of zone append.
Good point. We are posting a RFC with fio support for append. We wanted
to start the conversation here before.
We can post a fork for improve the reviews in V2.
>
>I do not believe it's beneficial at this point to change the libaio
>API, applications that would want to use this API, should anyway
>switch to use io_uring.
I can see why you say this, but isn't it too restrictive to directly
drop libaio support? We can split the patches and merge uring first- no
proble,.
>
>Please also note that applications and libraries that want to take
>advantage of zone append, can already use the zonefs file-system, as
>it will use the zone append command when applicable.
Sure. There are different paths available already, which is great. We
have use cases for uring and would like to enable them too.
Thanks,
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 6:56 ` Christoph Hellwig
@ 2020-06-18 8:29 ` Javier González
2020-06-18 17:52 ` Kanchan Joshi
1 sibling, 0 replies; 40+ messages in thread
From: Javier González @ 2020-06-18 8:29 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty
On 17.06.2020 23:56, Christoph Hellwig wrote:
>On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
>> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>
>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>> of the zone to issue append. On completion 'res2' field is used to return
>> zone-relative offset.
>>
>> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>
>And what exactly are the semantics supposed to be? Remember the
>unix file abstractions does not know about zones at all.
>
>I really don't think squeezing low-level not quite block storage
>protocol details into the Linux read/write path is a good idea.
>
>What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
>to report where they actually wrote, as that comes close to Zone Append
>while still making sense at our usual abstraction level for file I/O.
Makes sense. We will look into this for a V2.
Thanks,
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:27 ` Javier González
@ 2020-06-18 8:32 ` Matias Bjørling
2020-06-18 8:39 ` Javier González
0 siblings, 1 reply; 40+ messages in thread
From: Matias Bjørling @ 2020-06-18 8:32 UTC (permalink / raw)
To: Javier González
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
Damien Le Moal, Keith Busch, Christoph Hellwig
On 18/06/2020 10.27, Javier González wrote:
> On 18.06.2020 10:04, Matias Bjørling wrote:
>> On 17/06/2020 19.23, Kanchan Joshi wrote:
>>> This patchset enables issuing zone-append using aio and io-uring
>>> direct-io interface.
>>>
>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application
>>> uses start LBA
>>> of the zone to issue append. On completion 'res2' field is used to
>>> return
>>> zone-relative offset.
>>>
>>> For io-uring, this introduces three opcodes:
>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>> Since io_uring does not have aio-like res2, cqe->flags are
>>> repurposed to return zone-relative offset
>>
>> Please provide a pointers to applications that are updated and ready
>> to take advantage of zone append.
>
> Good point. We are posting a RFC with fio support for append. We wanted
> to start the conversation here before.
>
> We can post a fork for improve the reviews in V2.
Christoph's response points that it is not exactly clear how this
matches with the POSIX API.
fio support is great - but I was thinking along the lines of
applications that not only benchmark performance. fio should be part of
the supported applications, but should not be the sole reason the API is
added.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-18 7:39 ` Damien Le Moal
@ 2020-06-18 8:35 ` [email protected]
2020-06-18 8:47 ` Damien Le Moal
0 siblings, 1 reply; 40+ messages in thread
From: [email protected] @ 2020-06-18 8:35 UTC (permalink / raw)
To: Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 18.06.2020 07:39, Damien Le Moal wrote:
>On 2020/06/18 2:27, Kanchan Joshi wrote:
>> From: Selvakumar S <[email protected]>
>>
>> Introduce three new opcodes for zone-append -
>>
>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>
>> Repurpose cqe->flags to return zone-relative offset.
>>
>> Signed-off-by: SelvaKumar S <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> Signed-off-by: Nitesh Shetty <[email protected]>
>> Signed-off-by: Javier Gonzalez <[email protected]>
>> ---
>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>> include/uapi/linux/io_uring.h | 8 ++++-
>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 155f3d8..c14c873 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -649,6 +649,10 @@ struct io_kiocb {
>> unsigned long fsize;
>> u64 user_data;
>> u32 result;
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> + /* zone-relative offset for append, in bytes */
>> + u32 append_offset;
>
>this can overflow. u64 is needed.
We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.
We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.
Do you believe this is a better approach?
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:32 ` Matias Bjørling
@ 2020-06-18 8:39 ` Javier González
2020-06-18 8:46 ` Matias Bjørling
0 siblings, 1 reply; 40+ messages in thread
From: Javier González @ 2020-06-18 8:39 UTC (permalink / raw)
To: Matias Bjørling
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
Damien Le Moal, Keith Busch, Christoph Hellwig
On 18.06.2020 10:32, Matias Bjørling wrote:
>On 18/06/2020 10.27, Javier González wrote:
>>On 18.06.2020 10:04, Matias Bjørling wrote:
>>>On 17/06/2020 19.23, Kanchan Joshi wrote:
>>>>This patchset enables issuing zone-append using aio and io-uring
>>>>direct-io interface.
>>>>
>>>>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND.
>>>>Application uses start LBA
>>>>of the zone to issue append. On completion 'res2' field is used
>>>>to return
>>>>zone-relative offset.
>>>>
>>>>For io-uring, this introduces three opcodes:
>>>>IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>>>Since io_uring does not have aio-like res2, cqe->flags are
>>>>repurposed to return zone-relative offset
>>>
>>>Please provide a pointers to applications that are updated and
>>>ready to take advantage of zone append.
>>
>>Good point. We are posting a RFC with fio support for append. We wanted
>>to start the conversation here before.
>>
>>We can post a fork for improve the reviews in V2.
>
>Christoph's response points that it is not exactly clear how this
>matches with the POSIX API.
Yes. We will address this.
>
>fio support is great - but I was thinking along the lines of
>applications that not only benchmark performance. fio should be part
>of the supported applications, but should not be the sole reason the
>API is added.
Agree. It is a process with different steps. We definitely want to have
the right kernel interface before pushing any changes to libraries and /
or applications. These will come as the interface becomes more stable.
To start with xNVMe will be leveraging this new path. A number of
customers are leveraging the xNVMe API for their applications already.
Thanks,
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:39 ` Javier González
@ 2020-06-18 8:46 ` Matias Bjørling
0 siblings, 0 replies; 40+ messages in thread
From: Matias Bjørling @ 2020-06-18 8:46 UTC (permalink / raw)
To: Javier González
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
Damien Le Moal, Keith Busch, Christoph Hellwig
On 18/06/2020 10.39, Javier González wrote:
> On 18.06.2020 10:32, Matias Bjørling wrote:
>> On 18/06/2020 10.27, Javier González wrote:
>>> On 18.06.2020 10:04, Matias Bjørling wrote:
>>>> On 17/06/2020 19.23, Kanchan Joshi wrote:
>>>>> This patchset enables issuing zone-append using aio and io-uring
>>>>> direct-io interface.
>>>>>
>>>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application
>>>>> uses start LBA
>>>>> of the zone to issue append. On completion 'res2' field is used to
>>>>> return
>>>>> zone-relative offset.
>>>>>
>>>>> For io-uring, this introduces three opcodes:
>>>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>>>> Since io_uring does not have aio-like res2, cqe->flags are
>>>>> repurposed to return zone-relative offset
>>>>
>>>> Please provide a pointers to applications that are updated and
>>>> ready to take advantage of zone append.
>>>
>>> Good point. We are posting a RFC with fio support for append. We wanted
>>> to start the conversation here before.
>>>
>>> We can post a fork for improve the reviews in V2.
>>
>> Christoph's response points that it is not exactly clear how this
>> matches with the POSIX API.
>
> Yes. We will address this.
>>
>> fio support is great - but I was thinking along the lines of
>> applications that not only benchmark performance. fio should be part
>> of the supported applications, but should not be the sole reason the
>> API is added.
>
> Agree. It is a process with different steps. We definitely want to have
> the right kernel interface before pushing any changes to libraries and /
> or applications. These will come as the interface becomes more stable.
>
> To start with xNVMe will be leveraging this new path. A number of
> customers are leveraging the xNVMe API for their applications already.
Heh, let me be even more specific - open-source applications, that is
outside of fio (or any other benchmarking application), and libraries
that acts as a mediator between two APIs.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-18 8:35 ` [email protected]
@ 2020-06-18 8:47 ` Damien Le Moal
2020-06-18 9:11 ` [email protected]
0 siblings, 1 reply; 40+ messages in thread
From: Damien Le Moal @ 2020-06-18 8:47 UTC (permalink / raw)
To: [email protected]
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 2020/06/18 17:35, [email protected] wrote:
> On 18.06.2020 07:39, Damien Le Moal wrote:
>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>> From: Selvakumar S <[email protected]>
>>>
>>> Introduce three new opcodes for zone-append -
>>>
>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>
>>> Repurpose cqe->flags to return zone-relative offset.
>>>
>>> Signed-off-by: SelvaKumar S <[email protected]>
>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>> ---
>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>> include/uapi/linux/io_uring.h | 8 ++++-
>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 155f3d8..c14c873 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>> unsigned long fsize;
>>> u64 user_data;
>>> u32 result;
>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>> + /* zone-relative offset for append, in bytes */
>>> + u32 append_offset;
>>
>> this can overflow. u64 is needed.
>
> We chose to do it this way to start with because struct io_uring_cqe
> only has space for u32 when we reuse the flags.
>
> We can of course create a new cqe structure, but that will come with
> larger changes to io_uring for supporting append.
>
> Do you believe this is a better approach?
The problem is that zone size are 32 bits in the kernel, as a number of sectors.
So any device that has a zone size smaller or equal to 2^31 512B sectors can be
accepted. Using a zone relative offset in bytes for returning zone append result
is OK-ish, but to match the kernel supported range of possible zone size, you
need 31+9 bits... 32 does not cut it.
Since you need a 64-bit sized result, I would also prefer that you drop the zone
relative offset as a result and return the absolute offset instead. That makes
life easier for the applications since the zone append requests also must use
absolute offsets for zone start. An absolute offset as a result becomes
consistent with that and all other read/write system calls that all use absolute
offsets (seek() is the only one that I know of that can use a relative offset,
but that is not an IO system call).
>
> Javier
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-18 8:47 ` Damien Le Moal
@ 2020-06-18 9:11 ` [email protected]
2020-06-19 9:41 ` [email protected]
0 siblings, 1 reply; 40+ messages in thread
From: [email protected] @ 2020-06-18 9:11 UTC (permalink / raw)
To: Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 18.06.2020 08:47, Damien Le Moal wrote:
>On 2020/06/18 17:35, [email protected] wrote:
>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>> From: Selvakumar S <[email protected]>
>>>>
>>>> Introduce three new opcodes for zone-append -
>>>>
>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>
>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>
>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>> ---
>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 155f3d8..c14c873 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>> unsigned long fsize;
>>>> u64 user_data;
>>>> u32 result;
>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>> + /* zone-relative offset for append, in bytes */
>>>> + u32 append_offset;
>>>
>>> this can overflow. u64 is needed.
>>
>> We chose to do it this way to start with because struct io_uring_cqe
>> only has space for u32 when we reuse the flags.
>>
>> We can of course create a new cqe structure, but that will come with
>> larger changes to io_uring for supporting append.
>>
>> Do you believe this is a better approach?
>
>The problem is that zone size are 32 bits in the kernel, as a number of sectors.
>So any device that has a zone size smaller or equal to 2^31 512B sectors can be
>accepted. Using a zone relative offset in bytes for returning zone append result
>is OK-ish, but to match the kernel supported range of possible zone size, you
>need 31+9 bits... 32 does not cut it.
Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.
>
>Since you need a 64-bit sized result, I would also prefer that you drop the zone
>relative offset as a result and return the absolute offset instead. That makes
>life easier for the applications since the zone append requests also must use
>absolute offsets for zone start. An absolute offset as a result becomes
>consistent with that and all other read/write system calls that all use absolute
>offsets (seek() is the only one that I know of that can use a relative offset,
>but that is not an IO system call).
Agree. Using relative offsets was a product of reusing the existing u32.
If we move to u64, there is no need to do an extra transformation.
Thanks Damien!
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:04 ` Matias Bjørling
2020-06-18 8:27 ` Javier González
@ 2020-06-18 14:16 ` Christoph Hellwig
2020-06-18 19:21 ` Kanchan Joshi
2 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-18 14:16 UTC (permalink / raw)
To: Matias Bjørling
Cc: Kanchan Joshi, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
javier.gonz, Damien Le Moal, Keith Busch, Christoph Hellwig
On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote:
> Please provide a pointers to applications that are updated and ready to take
> advantage of zone append.
That is a pretty high bar for kernel APIs that we don't otherwise
apply unless seriously in doubt.
> I do not believe it's beneficial at this point to change the libaio API,
> applications that would want to use this API, should anyway switch to use
> io_uring.
I think that really depends on the amount of churn required. We
absolutely can expose things like small additional flags or simple
new operations, as rewriting application to different APIs is not
exactly trivial. On the other hand we really shouldn't do huge
additions to the machinery.
> Please also note that applications and libraries that want to take advantage
> of zone append, can already use the zonefs file-system, as it will use the
> zone append command when applicable.
Not really. While we already use Zone Append in Zonefs for some cases,
we can't fully take advantage of the scalability of Zone Append. For
that we'd need a way to return the file position where an O_APPEND
write actually landed, as suggested in my earlier mail. Which I think
is a very useful addition, and Damien and I had looked into adding
it both for zonefs and normal file systems, but didn't get around to
doing the work yet.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 6:56 ` Christoph Hellwig
2020-06-18 8:29 ` Javier González
@ 2020-06-18 17:52 ` Kanchan Joshi
2020-06-19 3:08 ` Damien Le Moal
2020-06-19 7:56 ` Christoph Hellwig
1 sibling, 2 replies; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-18 17:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, viro, bcrl, linux-fsdevel, linux-kernel, linux-aio,
io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz
[-- Attachment #1: Type: text/plain, Size: 1754 bytes --]
On Wed, Jun 17, 2020 at 11:56:34PM -0700, Christoph Hellwig wrote:
>On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
>> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>
>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>> of the zone to issue append. On completion 'res2' field is used to return
>> zone-relative offset.
>>
>> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>
>And what exactly are the semantics supposed to be? Remember the
>unix file abstractions does not know about zones at all.
>
>I really don't think squeezing low-level not quite block storage
>protocol details into the Linux read/write path is a good idea.
I was thinking of raw block-access to zone device rather than pristine file
abstraction. And in that context, semantics, at this point, are unchanged
(i.e. same as direct writes) while flexibility of async-interface gets
added.
Synchronous-writes on single-zone sound fine, but synchronous-appends on
single-zone do not sound that fine.
>What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
>to report where they actually wrote, as that comes close to Zone Append
>while still making sense at our usual abstraction level for file I/O.
Thanks for suggesting this. O and RWF_APPEND may not go well with block
access as end-of-file will be picked from dev inode. But perhaps a new
flag like RWF_ZONE_APPEND can help to transform writes (aio or uring)
into append without introducing new opcodes.
And, I think, this can fit fine on file-abstraction of ZoneFS as well.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling
2020-06-18 7:16 ` Damien Le Moal
@ 2020-06-18 18:35 ` Kanchan Joshi
0 siblings, 0 replies; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-18 18:35 UTC (permalink / raw)
To: Damien Le Moal
Cc: [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
[-- Attachment #1: Type: text/plain, Size: 3525 bytes --]
On Thu, Jun 18, 2020 at 07:16:19AM +0000, Damien Le Moal wrote:
>On 2020/06/18 2:27, Kanchan Joshi wrote:
>> From: Selvakumar S <[email protected]>
>>
>> Introduce IOCB_ZONE_APPEND flag, which is set in kiocb->ki_flags for
>> zone-append. Direct I/O submission path uses this flag to send bio with
>> append op. And completion path uses the same to return zone-relative
>> offset to upper layer.
>>
>> Signed-off-by: SelvaKumar S <[email protected]>
>> Signed-off-by: Kanchan Joshi <[email protected]>
>> Signed-off-by: Nitesh Shetty <[email protected]>
>> Signed-off-by: Javier Gonzalez <[email protected]>
>> ---
>> fs/block_dev.c | 19 ++++++++++++++++++-
>> include/linux/fs.h | 1 +
>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 47860e5..4c84b4d0 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
>> /* avoid the need for a I/O completion work item */
>> if (iocb->ki_flags & IOCB_DSYNC)
>> op |= REQ_FUA;
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
>> + op |= REQ_OP_ZONE_APPEND | REQ_NOMERGE;
>> +#endif
>
>No need for the #ifdef. And no need for the REQ_NOMERGE either since
>REQ_OP_ZONE_APPEND requests are defined as not mergeable already.
>
>> return op;
>> }
>>
>> @@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
>> return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
>> }
>>
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> +static inline long blkdev_bio_end_io_append(struct bio *bio)
>> +{
>> + return (bio->bi_iter.bi_sector %
>> + blk_queue_zone_sectors(bio->bi_disk->queue)) << SECTOR_SHIFT;
>
>A zone size is at most 4G sectors as defined by the queue chunk_sectors limit
>(unsigned int). It means that the return value here can overflow due to the
>shift, at least on 32bits arch.
>
>And as Pavel already commented, zone sizes are power of 2 so you can use
>bitmasks instead of costly divisions.
>
>> +}
>> +#endif
>> +
>> static void blkdev_bio_end_io(struct bio *bio)
>> {
>> struct blkdev_dio *dio = bio->bi_private;
>> @@ -307,15 +319,20 @@ static void blkdev_bio_end_io(struct bio *bio)
>> if (!dio->is_sync) {
>> struct kiocb *iocb = dio->iocb;
>> ssize_t ret;
>> + long res = 0;
>>
>> if (likely(!dio->bio.bi_status)) {
>> ret = dio->size;
>> iocb->ki_pos += ret;
>> +#ifdef CONFIG_BLK_DEV_ZONED
>> + if (iocb->ki_flags & IOCB_ZONE_APPEND)
>> + res = blkdev_bio_end_io_append(bio);
>
>overflow... And no need for the #ifdef.
>
>> +#endif
>> } else {
>> ret = blk_status_to_errno(dio->bio.bi_status);
>> }
>>
>> - dio->iocb->ki_complete(iocb, ret, 0);
>> + dio->iocb->ki_complete(iocb, ret, res);
>
>ki_complete interface also needs to be changed to have a 64bit last argument to
>avoid overflow.
>
>And this all does not work anyway because __blkdev_direct_IO() and
>__blkdev_direct_IO_simple() both call bio_iov_iter_get_pages() *before*
>dio_bio_write_op() is called. This means that bio_iov_iter_get_pages() will not
>see that it needs to get the pages for a zone append command and so will not
>call __bio_iov_append_get_pages(). That leads to BIO split and potential
>corruption of the user data as fragments of the kiocb may get reordered.
>
>There is a lot more to do to the block_dev direct IO code for this to work.
Thanks a lot for the great review. Very helpful. We'll fix in V2.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 8:04 ` Matias Bjørling
2020-06-18 8:27 ` Javier González
2020-06-18 14:16 ` Christoph Hellwig
@ 2020-06-18 19:21 ` Kanchan Joshi
2020-06-18 20:04 ` Matias Bjørling
2 siblings, 1 reply; 40+ messages in thread
From: Kanchan Joshi @ 2020-06-18 19:21 UTC (permalink / raw)
To: Matias Bjørling
Cc: axboe, viro, bcrl, linux-fsdevel, linux-kernel, linux-aio,
io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz,
Damien Le Moal, Keith Busch, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]
On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote:
>On 17/06/2020 19.23, Kanchan Joshi wrote:
>>This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>
>>For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>>of the zone to issue append. On completion 'res2' field is used to return
>>zone-relative offset.
>>
>>For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>
>Please provide a pointers to applications that are updated and ready
>to take advantage of zone append.
>
>I do not believe it's beneficial at this point to change the libaio
>API, applications that would want to use this API, should anyway
>switch to use io_uring.
>
>Please also note that applications and libraries that want to take
>advantage of zone append, can already use the zonefs file-system, as
>it will use the zone append command when applicable.
AFAIK, zonefs uses append while serving synchronous I/O. And append bio
is waited upon synchronously. That maybe serving some purpose I do
not know currently. But it seems applications using zonefs file
abstraction will get benefitted if they could use the append themselves to
carry the I/O, asynchronously.
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 19:21 ` Kanchan Joshi
@ 2020-06-18 20:04 ` Matias Bjørling
2020-06-19 1:03 ` Damien Le Moal
0 siblings, 1 reply; 40+ messages in thread
From: Matias Bjørling @ 2020-06-18 20:04 UTC (permalink / raw)
To: Kanchan Joshi
Cc: axboe, viro, bcrl, linux-fsdevel, linux-kernel, linux-aio,
io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz,
Damien Le Moal, Keith Busch, Christoph Hellwig
On 18/06/2020 21.21, Kanchan Joshi wrote:
> On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote:
>> On 17/06/2020 19.23, Kanchan Joshi wrote:
>>> This patchset enables issuing zone-append using aio and io-uring
>>> direct-io interface.
>>>
>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application
>>> uses start LBA
>>> of the zone to issue append. On completion 'res2' field is used to
>>> return
>>> zone-relative offset.
>>>
>>> For io-uring, this introduces three opcodes:
>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>> Since io_uring does not have aio-like res2, cqe->flags are
>>> repurposed to return zone-relative offset
>>
>> Please provide a pointers to applications that are updated and ready
>> to take advantage of zone append.
>>
>> I do not believe it's beneficial at this point to change the libaio
>> API, applications that would want to use this API, should anyway
>> switch to use io_uring.
>>
>> Please also note that applications and libraries that want to take
>> advantage of zone append, can already use the zonefs file-system, as
>> it will use the zone append command when applicable.
>
> AFAIK, zonefs uses append while serving synchronous I/O. And append bio
> is waited upon synchronously. That maybe serving some purpose I do
> not know currently. But it seems applications using zonefs file
> abstraction will get benefitted if they could use the append
> themselves to
> carry the I/O, asynchronously.
Yep, please see Christoph's comment regarding adding the support to zonefs.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 20:04 ` Matias Bjørling
@ 2020-06-19 1:03 ` Damien Le Moal
0 siblings, 0 replies; 40+ messages in thread
From: Damien Le Moal @ 2020-06-19 1:03 UTC (permalink / raw)
To: Matias Bjørling, Kanchan Joshi
Cc: [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], Keith Busch,
Christoph Hellwig
On 2020/06/19 5:04, Matias Bjørling wrote:
> On 18/06/2020 21.21, Kanchan Joshi wrote:
>> On Thu, Jun 18, 2020 at 10:04:32AM +0200, Matias Bjørling wrote:
>>> On 17/06/2020 19.23, Kanchan Joshi wrote:
>>>> This patchset enables issuing zone-append using aio and io-uring
>>>> direct-io interface.
>>>>
>>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application
>>>> uses start LBA
>>>> of the zone to issue append. On completion 'res2' field is used to
>>>> return
>>>> zone-relative offset.
>>>>
>>>> For io-uring, this introduces three opcodes:
>>>> IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>>> Since io_uring does not have aio-like res2, cqe->flags are
>>>> repurposed to return zone-relative offset
>>>
>>> Please provide a pointers to applications that are updated and ready
>>> to take advantage of zone append.
>>>
>>> I do not believe it's beneficial at this point to change the libaio
>>> API, applications that would want to use this API, should anyway
>>> switch to use io_uring.
>>>
>>> Please also note that applications and libraries that want to take
>>> advantage of zone append, can already use the zonefs file-system, as
>>> it will use the zone append command when applicable.
>>
>> AFAIK, zonefs uses append while serving synchronous I/O. And append bio
>> is waited upon synchronously. That maybe serving some purpose I do
>> not know currently. But it seems applications using zonefs file
>> abstraction will get benefitted if they could use the append
>> themselves to
>> carry the I/O, asynchronously.
> Yep, please see Christoph's comment regarding adding the support to zonefs.
For the asynchronous processing of zone append in zonefs, we need to add
plumbing in the iomap code first. Since this is missing currently, zonefs can
only do synchronous/blocking zone append for now. Will be working on that, if we
can come up with a semantic that makes sense for posix system calls. zonefs is
not a posix compliant file system, so we are not strongly tied by posix
specifications. But we still want to make it as easy as possible to understand
and use by the user.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 17:52 ` Kanchan Joshi
@ 2020-06-19 3:08 ` Damien Le Moal
2020-06-19 7:56 ` Christoph Hellwig
1 sibling, 0 replies; 40+ messages in thread
From: Damien Le Moal @ 2020-06-19 3:08 UTC (permalink / raw)
To: Kanchan Joshi, [email protected]
Cc: [email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 2020/06/19 2:55, Kanchan Joshi wrote:
> On Wed, Jun 17, 2020 at 11:56:34PM -0700, Christoph Hellwig wrote:
>> On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
>>> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>>
>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>>> of the zone to issue append. On completion 'res2' field is used to return
>>> zone-relative offset.
>>>
>>> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>>
>> And what exactly are the semantics supposed to be? Remember the
>> unix file abstractions does not know about zones at all.
>>
>> I really don't think squeezing low-level not quite block storage
>> protocol details into the Linux read/write path is a good idea.
>
> I was thinking of raw block-access to zone device rather than pristine file
> abstraction. And in that context, semantics, at this point, are unchanged
> (i.e. same as direct writes) while flexibility of async-interface gets
> added.
The aio->aio_offset use by the user and kernel differs for regular writes and
zone append writes. This is a significant enough change to say that semantic
changed. Yes both cases are direct IOs, but specification of the write location
by the user and where the data actually lands on disk are different.
There are a lot of subtle things that can happen that makes mapping of zone
append operations to POSIX semantic difficult. E.g. for a regular file, using
zone append for any write issued to a file open with O_APPEND maps well to POSIX
only for blocking writes. For asynchronous writes, that is not true anymore
since the order of data defined by the automatic append after the previous async
write breaks: data can land anywhere in the zone regardless of the offset
specified on submission.
> Synchronous-writes on single-zone sound fine, but synchronous-appends on
> single-zone do not sound that fine.
Why not ? This is a perfectly valid use case that actually does not have any
semantic problem. It indeed may not be the most effective method to get high
performance but saying that it is "not fine" is not correct in my opinion.
>
>> What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
>> to report where they actually wrote, as that comes close to Zone Append
>> while still making sense at our usual abstraction level for file I/O.
>
> Thanks for suggesting this. O and RWF_APPEND may not go well with block
> access as end-of-file will be picked from dev inode. But perhaps a new
> flag like RWF_ZONE_APPEND can help to transform writes (aio or uring)
> into append without introducing new opcodes.
Yes, RWF_ZONE_APPEND may be better if the semantic of RWF_APPEND cannot be
cleanly reused. But as Christoph said, RWF_ZONE_APPEND semantic need to be
clarified so that all reviewer can check the code against the intended behavior,
and comment on that intended behavior too.
> And, I think, this can fit fine on file-abstraction of ZoneFS as well.
May be. Depends on what semantic you are after for user zone append interface.
Ideally, we should have at least the same for raw block device and zonefs. But
zonefs may be able to do a better job thanks to its real regular file
abstraction of zones. As Christoph said, we started looking into it but lacked
time to complete this work. This is still on-going.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 0/3] zone-append support in aio and io-uring
2020-06-18 17:52 ` Kanchan Joshi
2020-06-19 3:08 ` Damien Le Moal
@ 2020-06-19 7:56 ` Christoph Hellwig
1 sibling, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-19 7:56 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, axboe, viro, bcrl, linux-fsdevel, linux-kernel,
linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty,
javier.gonz
On Thu, Jun 18, 2020 at 11:22:58PM +0530, Kanchan Joshi wrote:
> I was thinking of raw block-access to zone device rather than pristine file
> abstraction.
Why?
> And in that context, semantics, at this point, are unchanged
> (i.e. same as direct writes) while flexibility of async-interface gets
> added.
> Synchronous-writes on single-zone sound fine, but synchronous-appends on
> single-zone do not sound that fine.
Where does synchronous access come into play?
> > What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
> > to report where they actually wrote, as that comes close to Zone Append
> > while still making sense at our usual abstraction level for file I/O.
>
> Thanks for suggesting this. O and RWF_APPEND may not go well with block
> access as end-of-file will be picked from dev inode.
No, but they go really well with zonefs.
> But perhaps a new
> flag like RWF_ZONE_APPEND can help to transform writes (aio or uring)
> into append without introducing new opcodes.
I don't think this is a good idea. Zones are a concept for a a very
specific class of zoned devices. Trying to shoe-horn this into the
byte address files / whole device abstraction not only is ugly
conceptually but also adds the overhead for it to the VFS.
And O_APPEND that returns the written position OTOH makes total sense
at the file level as well and not just for raw zoned devices.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-18 9:11 ` [email protected]
@ 2020-06-19 9:41 ` [email protected]
2020-06-19 11:15 ` Matias Bjørling
2020-06-19 14:15 ` Jens Axboe
0 siblings, 2 replies; 40+ messages in thread
From: [email protected] @ 2020-06-19 9:41 UTC (permalink / raw)
To: Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
[-- Attachment #1: Type: text/plain, Size: 2691 bytes --]
Jens,
Would you have time to answer a question below in this thread?
On 18.06.2020 11:11, [email protected] wrote:
>On 18.06.2020 08:47, Damien Le Moal wrote:
>>On 2020/06/18 17:35, [email protected] wrote:
>>>On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>From: Selvakumar S <[email protected]>
>>>>>
>>>>>Introduce three new opcodes for zone-append -
>>>>>
>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>
>>>>>Repurpose cqe->flags to return zone-relative offset.
>>>>>
>>>>>Signed-off-by: SelvaKumar S <[email protected]>
>>>>>Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>---
>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>
>>>>>diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>index 155f3d8..c14c873 100644
>>>>>--- a/fs/io_uring.c
>>>>>+++ b/fs/io_uring.c
>>>>>@@ -649,6 +649,10 @@ struct io_kiocb {
>>>>> unsigned long fsize;
>>>>> u64 user_data;
>>>>> u32 result;
>>>>>+#ifdef CONFIG_BLK_DEV_ZONED
>>>>>+ /* zone-relative offset for append, in bytes */
>>>>>+ u32 append_offset;
>>>>
>>>>this can overflow. u64 is needed.
>>>
>>>We chose to do it this way to start with because struct io_uring_cqe
>>>only has space for u32 when we reuse the flags.
>>>
>>>We can of course create a new cqe structure, but that will come with
>>>larger changes to io_uring for supporting append.
>>>
>>>Do you believe this is a better approach?
>>
>>The problem is that zone size are 32 bits in the kernel, as a number of sectors.
>>So any device that has a zone size smaller or equal to 2^31 512B sectors can be
>>accepted. Using a zone relative offset in bytes for returning zone append result
>>is OK-ish, but to match the kernel supported range of possible zone size, you
>>need 31+9 bits... 32 does not cut it.
>
>Agree. Our initial assumption was that u32 would cover current zone size
>requirements, but if this is a no-go, we will take the longer path.
Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().
Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.
Thanks,
Javier
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 9:41 ` [email protected]
@ 2020-06-19 11:15 ` Matias Bjørling
2020-06-19 14:18 ` Jens Axboe
2020-06-19 14:15 ` Jens Axboe
1 sibling, 1 reply; 40+ messages in thread
From: Matias Bjørling @ 2020-06-19 11:15 UTC (permalink / raw)
To: [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected]
On 19/06/2020 11.41, [email protected] wrote:
> Jens,
>
> Would you have time to answer a question below in this thread?
>
> On 18.06.2020 11:11, [email protected] wrote:
>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>> On 2020/06/18 17:35, [email protected] wrote:
>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>> From: Selvakumar S <[email protected]>
>>>>>>
>>>>>> Introduce three new opcodes for zone-append -
>>>>>>
>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>> IORING_OP_WRITE
>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>
>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>
>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>> ---
>>>>>> fs/io_uring.c | 72
>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>> index 155f3d8..c14c873 100644
>>>>>> --- a/fs/io_uring.c
>>>>>> +++ b/fs/io_uring.c
>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>> unsigned long fsize;
>>>>>> u64 user_data;
>>>>>> u32 result;
>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>> + u32 append_offset;
>>>>>
>>>>> this can overflow. u64 is needed.
>>>>
>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>> only has space for u32 when we reuse the flags.
>>>>
>>>> We can of course create a new cqe structure, but that will come with
>>>> larger changes to io_uring for supporting append.
>>>>
>>>> Do you believe this is a better approach?
>>>
>>> The problem is that zone size are 32 bits in the kernel, as a number
>>> of sectors.
>>> So any device that has a zone size smaller or equal to 2^31 512B
>>> sectors can be
>>> accepted. Using a zone relative offset in bytes for returning zone
>>> append result
>>> is OK-ish, but to match the kernel supported range of possible zone
>>> size, you
>>> need 31+9 bits... 32 does not cut it.
>>
>> Agree. Our initial assumption was that u32 would cover current zone size
>> requirements, but if this is a no-go, we will take the longer path.
>
> Converting to u64 will require a new version of io_uring_cqe, where we
> extend at least 32 bits. I believe this will need a whole new allocation
> and probably ioctl().
>
> Is this an acceptable change for you? We will of course add support for
> liburing when we agree on the right way to do this.
I took a quick look at the code. No expert, but why not use the existing
userdata variable? use the lowest bits (40 bits) for the Zone Starting
LBA, and use the highest (24 bits) as index into the completion data
structure?
If you want to pass the memory address (same as what fio does) for the
data structure used for completion, one may also play some tricks by
using a relative memory address to the data structure. For example, the
x86_64 architecture uses 48 address bits for its memory addresses. With
24 bit, one can allocate the completion entries in a 32MB memory range,
and then use base_address + index to get back to the completion data
structure specified in the sqe.
Best, Matias
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 9:41 ` [email protected]
2020-06-19 11:15 ` Matias Bjørling
@ 2020-06-19 14:15 ` Jens Axboe
2020-06-19 14:59 ` Pavel Begunkov
1 sibling, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2020-06-19 14:15 UTC (permalink / raw)
To: [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 6/19/20 3:41 AM, [email protected] wrote:
> Jens,
>
> Would you have time to answer a question below in this thread?
>
> On 18.06.2020 11:11, [email protected] wrote:
>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>> On 2020/06/18 17:35, [email protected] wrote:
>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>> From: Selvakumar S <[email protected]>
>>>>>>
>>>>>> Introduce three new opcodes for zone-append -
>>>>>>
>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>
>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>
>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>> ---
>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>> index 155f3d8..c14c873 100644
>>>>>> --- a/fs/io_uring.c
>>>>>> +++ b/fs/io_uring.c
>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>> unsigned long fsize;
>>>>>> u64 user_data;
>>>>>> u32 result;
>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>> + u32 append_offset;
>>>>>
>>>>> this can overflow. u64 is needed.
>>>>
>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>> only has space for u32 when we reuse the flags.
>>>>
>>>> We can of course create a new cqe structure, but that will come with
>>>> larger changes to io_uring for supporting append.
>>>>
>>>> Do you believe this is a better approach?
>>>
>>> The problem is that zone size are 32 bits in the kernel, as a number
>>> of sectors. So any device that has a zone size smaller or equal to
>>> 2^31 512B sectors can be accepted. Using a zone relative offset in
>>> bytes for returning zone append result is OK-ish, but to match the
>>> kernel supported range of possible zone size, you need 31+9 bits...
>>> 32 does not cut it.
>>
>> Agree. Our initial assumption was that u32 would cover current zone size
>> requirements, but if this is a no-go, we will take the longer path.
>
> Converting to u64 will require a new version of io_uring_cqe, where we
> extend at least 32 bits. I believe this will need a whole new allocation
> and probably ioctl().
>
> Is this an acceptable change for you? We will of course add support for
> liburing when we agree on the right way to do this.
If you need 64-bit of return value, then it's not going to work. Even
with the existing patches, reusing cqe->flags isn't going to fly, as
it would conflict with eg doing zone append writes with automatic
buffer selection.
We're not changing the io_uring_cqe. It's important to keep it lean, and
any other request type is generally fine with 64-bit tag + 32-bit result
(and 32-bit flags on the side) for completions.
Only viable alternative I see would be to provide an area to store this
information, and pass in a pointer to this at submission time through
the sqe. One issue I do see with that is if we only have this
information available at completion time, then we'd need some async punt
to copy it to user space... Generally not ideal.
A hackier approach would be to play some tricks with cqe->res and
cqe->flags, setting aside a flag to denote an extension of cqe->res.
That would mean excluding zone append (etc) from using buffer selection,
which probably isn't a huge deal. It'd be more problematic for any other
future flags. But if you just need 40 bits, then it could certainly
work. Rigth now, if cqe->flags & 1 is set, then (cqe->flags >> 16) is
the buffer ID. You could define IORING_CQE_F_ZONE_FOO to be bit 1, so
that:
uint64_t val = cqe->res; // assuming non-error here
if (cqe->flags & IORING_CQE_F_ZONE_FOO)
val |= (cqe->flags >> 16) << 32ULL;
and hence use the upper 16 bits of cqe->flags for the upper bits of your
(then) 48-bit total value.
--
Jens Axboe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 11:15 ` Matias Bjørling
@ 2020-06-19 14:18 ` Jens Axboe
2020-06-19 15:14 ` Matias Bjørling
0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2020-06-19 14:18 UTC (permalink / raw)
To: Matias Bjørling, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 6/19/20 5:15 AM, Matias Bjørling wrote:
> On 19/06/2020 11.41, [email protected] wrote:
>> Jens,
>>
>> Would you have time to answer a question below in this thread?
>>
>> On 18.06.2020 11:11, [email protected] wrote:
>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>
>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>
>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>> IORING_OP_WRITE
>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>
>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>
>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>> ---
>>>>>>> fs/io_uring.c | 72
>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>> index 155f3d8..c14c873 100644
>>>>>>> --- a/fs/io_uring.c
>>>>>>> +++ b/fs/io_uring.c
>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>> unsigned long fsize;
>>>>>>> u64 user_data;
>>>>>>> u32 result;
>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>> + u32 append_offset;
>>>>>>
>>>>>> this can overflow. u64 is needed.
>>>>>
>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>> only has space for u32 when we reuse the flags.
>>>>>
>>>>> We can of course create a new cqe structure, but that will come with
>>>>> larger changes to io_uring for supporting append.
>>>>>
>>>>> Do you believe this is a better approach?
>>>>
>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>> of sectors.
>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>> sectors can be
>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>> append result
>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>> size, you
>>>> need 31+9 bits... 32 does not cut it.
>>>
>>> Agree. Our initial assumption was that u32 would cover current zone size
>>> requirements, but if this is a no-go, we will take the longer path.
>>
>> Converting to u64 will require a new version of io_uring_cqe, where we
>> extend at least 32 bits. I believe this will need a whole new allocation
>> and probably ioctl().
>>
>> Is this an acceptable change for you? We will of course add support for
>> liburing when we agree on the right way to do this.
>
> I took a quick look at the code. No expert, but why not use the existing
> userdata variable? use the lowest bits (40 bits) for the Zone Starting
> LBA, and use the highest (24 bits) as index into the completion data
> structure?
>
> If you want to pass the memory address (same as what fio does) for the
> data structure used for completion, one may also play some tricks by
> using a relative memory address to the data structure. For example, the
> x86_64 architecture uses 48 address bits for its memory addresses. With
> 24 bit, one can allocate the completion entries in a 32MB memory range,
> and then use base_address + index to get back to the completion data
> structure specified in the sqe.
For any current request, sqe->user_data is just provided back as
cqe->user_data. This would make these requests behave differently
from everything else in that sense, which seems very confusing to me
if I was an application writer.
But generally I do agree with you, there are lots of ways to make
< 64-bit work as a tag without losing anything or having to jump
through hoops to do so. The lack of consistency introduced by having
zone append work differently is ugly, though.
--
Jens Axboe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 14:15 ` Jens Axboe
@ 2020-06-19 14:59 ` Pavel Begunkov
2020-06-19 15:02 ` Jens Axboe
0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2020-06-19 14:59 UTC (permalink / raw)
To: Jens Axboe, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 19/06/2020 17:15, Jens Axboe wrote:
> On 6/19/20 3:41 AM, [email protected] wrote:
>> Jens,
>>
>> Would you have time to answer a question below in this thread?
>>
>> On 18.06.2020 11:11, [email protected] wrote:
>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>
>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>
>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>
>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>
>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>> ---
>>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>> index 155f3d8..c14c873 100644
>>>>>>> --- a/fs/io_uring.c
>>>>>>> +++ b/fs/io_uring.c
>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>> unsigned long fsize;
>>>>>>> u64 user_data;
>>>>>>> u32 result;
>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>> + u32 append_offset;
>>>>>>
>>>>>> this can overflow. u64 is needed.
>>>>>
>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>> only has space for u32 when we reuse the flags.
>>>>>
>>>>> We can of course create a new cqe structure, but that will come with
>>>>> larger changes to io_uring for supporting append.
>>>>>
>>>>> Do you believe this is a better approach?
>>>>
>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>> of sectors. So any device that has a zone size smaller or equal to
>>>> 2^31 512B sectors can be accepted. Using a zone relative offset in
>>>> bytes for returning zone append result is OK-ish, but to match the
>>>> kernel supported range of possible zone size, you need 31+9 bits...
>>>> 32 does not cut it.
>>>
>>> Agree. Our initial assumption was that u32 would cover current zone size
>>> requirements, but if this is a no-go, we will take the longer path.
>>
>> Converting to u64 will require a new version of io_uring_cqe, where we
>> extend at least 32 bits. I believe this will need a whole new allocation
>> and probably ioctl().
>>
>> Is this an acceptable change for you? We will of course add support for
>> liburing when we agree on the right way to do this.
>
> If you need 64-bit of return value, then it's not going to work. Even
> with the existing patches, reusing cqe->flags isn't going to fly, as
> it would conflict with eg doing zone append writes with automatic
> buffer selection.
Buffer selection is for reads/recv kind of requests, but appends
are writes. In theory they can co-exist using cqe->flags.
>
> We're not changing the io_uring_cqe. It's important to keep it lean, and
> any other request type is generally fine with 64-bit tag + 32-bit result
> (and 32-bit flags on the side) for completions.
>
> Only viable alternative I see would be to provide an area to store this
> information, and pass in a pointer to this at submission time through
> the sqe. One issue I do see with that is if we only have this
> information available at completion time, then we'd need some async punt
> to copy it to user space... Generally not ideal.
>
> A hackier approach would be to play some tricks with cqe->res and
> cqe->flags, setting aside a flag to denote an extension of cqe->res.
> That would mean excluding zone append (etc) from using buffer selection,
> which probably isn't a huge deal. It'd be more problematic for any other
> future flags. But if you just need 40 bits, then it could certainly
> work. Rigth now, if cqe->flags & 1 is set, then (cqe->flags >> 16) is
> the buffer ID. You could define IORING_CQE_F_ZONE_FOO to be bit 1, so
> that:
>
> uint64_t val = cqe->res; // assuming non-error here
>
> if (cqe->flags & IORING_CQE_F_ZONE_FOO)
> val |= (cqe->flags >> 16) << 32ULL;
>
> and hence use the upper 16 bits of cqe->flags for the upper bits of your
> (then) 48-bit total value.
How about returning offset in terms of 512-bytes chunks? NVMe is 512B
atomic/aligned. We'll lose an ability to do non-512 aligned appends, but
it won't hit media as such anyway (will be padded or cached), so
personally I don't see much benefit in having it.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 14:59 ` Pavel Begunkov
@ 2020-06-19 15:02 ` Jens Axboe
2020-06-21 18:52 ` [email protected]
0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2020-06-19 15:02 UTC (permalink / raw)
To: Pavel Begunkov, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 6/19/20 8:59 AM, Pavel Begunkov wrote:
> On 19/06/2020 17:15, Jens Axboe wrote:
>> On 6/19/20 3:41 AM, [email protected] wrote:
>>> Jens,
>>>
>>> Would you have time to answer a question below in this thread?
>>>
>>> On 18.06.2020 11:11, [email protected] wrote:
>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>
>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>
>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>
>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>
>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>> ---
>>>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>> --- a/fs/io_uring.c
>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>> unsigned long fsize;
>>>>>>>> u64 user_data;
>>>>>>>> u32 result;
>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>> + u32 append_offset;
>>>>>>>
>>>>>>> this can overflow. u64 is needed.
>>>>>>
>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>> only has space for u32 when we reuse the flags.
>>>>>>
>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>> larger changes to io_uring for supporting append.
>>>>>>
>>>>>> Do you believe this is a better approach?
>>>>>
>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>> of sectors. So any device that has a zone size smaller or equal to
>>>>> 2^31 512B sectors can be accepted. Using a zone relative offset in
>>>>> bytes for returning zone append result is OK-ish, but to match the
>>>>> kernel supported range of possible zone size, you need 31+9 bits...
>>>>> 32 does not cut it.
>>>>
>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>> requirements, but if this is a no-go, we will take the longer path.
>>>
>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>> extend at least 32 bits. I believe this will need a whole new allocation
>>> and probably ioctl().
>>>
>>> Is this an acceptable change for you? We will of course add support for
>>> liburing when we agree on the right way to do this.
>>
>> If you need 64-bit of return value, then it's not going to work. Even
>> with the existing patches, reusing cqe->flags isn't going to fly, as
>> it would conflict with eg doing zone append writes with automatic
>> buffer selection.
>
> Buffer selection is for reads/recv kind of requests, but appends
> are writes. In theory they can co-exist using cqe->flags.
Yeah good point, since it's just writes, doesn't matter. But the other
point still stands, it could potentially conflict with other flags, but
I guess only to the extent where both flags would need extra storage in
->flags. So not a huge concern imho.
--
Jens Axboe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 14:18 ` Jens Axboe
@ 2020-06-19 15:14 ` Matias Bjørling
2020-06-19 15:20 ` Jens Axboe
0 siblings, 1 reply; 40+ messages in thread
From: Matias Bjørling @ 2020-06-19 15:14 UTC (permalink / raw)
To: Jens Axboe, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 19/06/2020 16.18, Jens Axboe wrote:
> On 6/19/20 5:15 AM, Matias Bjørling wrote:
>> On 19/06/2020 11.41, [email protected] wrote:
>>> Jens,
>>>
>>> Would you have time to answer a question below in this thread?
>>>
>>> On 18.06.2020 11:11, [email protected] wrote:
>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>
>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>
>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>>> IORING_OP_WRITE
>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>
>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>
>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>> ---
>>>>>>>> fs/io_uring.c | 72
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>> --- a/fs/io_uring.c
>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>> unsigned long fsize;
>>>>>>>> u64 user_data;
>>>>>>>> u32 result;
>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>> + u32 append_offset;
>>>>>>> this can overflow. u64 is needed.
>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>> only has space for u32 when we reuse the flags.
>>>>>>
>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>> larger changes to io_uring for supporting append.
>>>>>>
>>>>>> Do you believe this is a better approach?
>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>> of sectors.
>>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>>> sectors can be
>>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>>> append result
>>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>>> size, you
>>>>> need 31+9 bits... 32 does not cut it.
>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>> requirements, but if this is a no-go, we will take the longer path.
>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>> extend at least 32 bits. I believe this will need a whole new allocation
>>> and probably ioctl().
>>>
>>> Is this an acceptable change for you? We will of course add support for
>>> liburing when we agree on the right way to do this.
>> I took a quick look at the code. No expert, but why not use the existing
>> userdata variable? use the lowest bits (40 bits) for the Zone Starting
>> LBA, and use the highest (24 bits) as index into the completion data
>> structure?
>>
>> If you want to pass the memory address (same as what fio does) for the
>> data structure used for completion, one may also play some tricks by
>> using a relative memory address to the data structure. For example, the
>> x86_64 architecture uses 48 address bits for its memory addresses. With
>> 24 bit, one can allocate the completion entries in a 32MB memory range,
>> and then use base_address + index to get back to the completion data
>> structure specified in the sqe.
> For any current request, sqe->user_data is just provided back as
> cqe->user_data. This would make these requests behave differently
> from everything else in that sense, which seems very confusing to me
> if I was an application writer.
>
> But generally I do agree with you, there are lots of ways to make
> < 64-bit work as a tag without losing anything or having to jump
> through hoops to do so. The lack of consistency introduced by having
> zone append work differently is ugly, though.
>
Yep, agree, and extending to three cachelines is big no-go. We could add
a flag that said the kernel has changes the userdata variable. That'll
make it very explicit.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 15:14 ` Matias Bjørling
@ 2020-06-19 15:20 ` Jens Axboe
2020-06-19 15:40 ` Matias Bjørling
0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2020-06-19 15:20 UTC (permalink / raw)
To: Matias Bjørling, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 6/19/20 9:14 AM, Matias Bjørling wrote:
> On 19/06/2020 16.18, Jens Axboe wrote:
>> On 6/19/20 5:15 AM, Matias Bjørling wrote:
>>> On 19/06/2020 11.41, [email protected] wrote:
>>>> Jens,
>>>>
>>>> Would you have time to answer a question below in this thread?
>>>>
>>>> On 18.06.2020 11:11, [email protected] wrote:
>>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>
>>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>>
>>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>>>> IORING_OP_WRITE
>>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>>
>>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>>
>>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>>> ---
>>>>>>>>> fs/io_uring.c | 72
>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>>> --- a/fs/io_uring.c
>>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>>> unsigned long fsize;
>>>>>>>>> u64 user_data;
>>>>>>>>> u32 result;
>>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>>> + u32 append_offset;
>>>>>>>> this can overflow. u64 is needed.
>>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>>> only has space for u32 when we reuse the flags.
>>>>>>>
>>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>>> larger changes to io_uring for supporting append.
>>>>>>>
>>>>>>> Do you believe this is a better approach?
>>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>>> of sectors.
>>>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>>>> sectors can be
>>>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>>>> append result
>>>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>>>> size, you
>>>>>> need 31+9 bits... 32 does not cut it.
>>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>>> requirements, but if this is a no-go, we will take the longer path.
>>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>>> extend at least 32 bits. I believe this will need a whole new allocation
>>>> and probably ioctl().
>>>>
>>>> Is this an acceptable change for you? We will of course add support for
>>>> liburing when we agree on the right way to do this.
>>> I took a quick look at the code. No expert, but why not use the existing
>>> userdata variable? use the lowest bits (40 bits) for the Zone Starting
>>> LBA, and use the highest (24 bits) as index into the completion data
>>> structure?
>>>
>>> If you want to pass the memory address (same as what fio does) for the
>>> data structure used for completion, one may also play some tricks by
>>> using a relative memory address to the data structure. For example, the
>>> x86_64 architecture uses 48 address bits for its memory addresses. With
>>> 24 bit, one can allocate the completion entries in a 32MB memory range,
>>> and then use base_address + index to get back to the completion data
>>> structure specified in the sqe.
>> For any current request, sqe->user_data is just provided back as
>> cqe->user_data. This would make these requests behave differently
>> from everything else in that sense, which seems very confusing to me
>> if I was an application writer.
>>
>> But generally I do agree with you, there are lots of ways to make
>> < 64-bit work as a tag without losing anything or having to jump
>> through hoops to do so. The lack of consistency introduced by having
>> zone append work differently is ugly, though.
>>
> Yep, agree, and extending to three cachelines is big no-go. We could add
> a flag that said the kernel has changes the userdata variable. That'll
> make it very explicit.
Don't like that either, as it doesn't really change the fact that you're
now doing something very different with the user_data field, which is
just supposed to be passed in/out directly. Adding a random flag to
signal this behavior isn't very explicit either, imho. It's still some
out-of-band (ish) notification of behavior that is different from any
other command. This is very different from having a flag that says
"there's extra information in this other field", which is much cleaner.
--
Jens Axboe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 15:20 ` Jens Axboe
@ 2020-06-19 15:40 ` Matias Bjørling
2020-06-19 15:44 ` Jens Axboe
0 siblings, 1 reply; 40+ messages in thread
From: Matias Bjørling @ 2020-06-19 15:40 UTC (permalink / raw)
To: Jens Axboe, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 19/06/2020 17.20, Jens Axboe wrote:
> On 6/19/20 9:14 AM, Matias Bjørling wrote:
>> On 19/06/2020 16.18, Jens Axboe wrote:
>>> On 6/19/20 5:15 AM, Matias Bjørling wrote:
>>>> On 19/06/2020 11.41, [email protected] wrote:
>>>>> Jens,
>>>>>
>>>>> Would you have time to answer a question below in this thread?
>>>>>
>>>>> On 18.06.2020 11:11, [email protected] wrote:
>>>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>>
>>>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>>>
>>>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>>>>> IORING_OP_WRITE
>>>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>>>
>>>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>> fs/io_uring.c | 72
>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>>>> --- a/fs/io_uring.c
>>>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>>>> unsigned long fsize;
>>>>>>>>>> u64 user_data;
>>>>>>>>>> u32 result;
>>>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>>>> + u32 append_offset;
>>>>>>>>> this can overflow. u64 is needed.
>>>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>>>> only has space for u32 when we reuse the flags.
>>>>>>>>
>>>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>>>> larger changes to io_uring for supporting append.
>>>>>>>>
>>>>>>>> Do you believe this is a better approach?
>>>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>>>> of sectors.
>>>>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>>>>> sectors can be
>>>>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>>>>> append result
>>>>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>>>>> size, you
>>>>>>> need 31+9 bits... 32 does not cut it.
>>>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>>>> requirements, but if this is a no-go, we will take the longer path.
>>>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>>>> extend at least 32 bits. I believe this will need a whole new allocation
>>>>> and probably ioctl().
>>>>>
>>>>> Is this an acceptable change for you? We will of course add support for
>>>>> liburing when we agree on the right way to do this.
>>>> I took a quick look at the code. No expert, but why not use the existing
>>>> userdata variable? use the lowest bits (40 bits) for the Zone Starting
>>>> LBA, and use the highest (24 bits) as index into the completion data
>>>> structure?
>>>>
>>>> If you want to pass the memory address (same as what fio does) for the
>>>> data structure used for completion, one may also play some tricks by
>>>> using a relative memory address to the data structure. For example, the
>>>> x86_64 architecture uses 48 address bits for its memory addresses. With
>>>> 24 bit, one can allocate the completion entries in a 32MB memory range,
>>>> and then use base_address + index to get back to the completion data
>>>> structure specified in the sqe.
>>> For any current request, sqe->user_data is just provided back as
>>> cqe->user_data. This would make these requests behave differently
>>> from everything else in that sense, which seems very confusing to me
>>> if I was an application writer.
>>>
>>> But generally I do agree with you, there are lots of ways to make
>>> < 64-bit work as a tag without losing anything or having to jump
>>> through hoops to do so. The lack of consistency introduced by having
>>> zone append work differently is ugly, though.
>>>
>> Yep, agree, and extending to three cachelines is big no-go. We could add
>> a flag that said the kernel has changes the userdata variable. That'll
>> make it very explicit.
> Don't like that either, as it doesn't really change the fact that you're
> now doing something very different with the user_data field, which is
> just supposed to be passed in/out directly. Adding a random flag to
> signal this behavior isn't very explicit either, imho. It's still some
> out-of-band (ish) notification of behavior that is different from any
> other command. This is very different from having a flag that says
> "there's extra information in this other field", which is much cleaner.
>
Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you
mention in the other mail. Sounds good.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 15:40 ` Matias Bjørling
@ 2020-06-19 15:44 ` Jens Axboe
2020-06-21 18:55 ` [email protected]
0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2020-06-19 15:44 UTC (permalink / raw)
To: Matias Bjørling, [email protected], Damien Le Moal
Cc: Kanchan Joshi, [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 6/19/20 9:40 AM, Matias Bjørling wrote:
> On 19/06/2020 17.20, Jens Axboe wrote:
>> On 6/19/20 9:14 AM, Matias Bjørling wrote:
>>> On 19/06/2020 16.18, Jens Axboe wrote:
>>>> On 6/19/20 5:15 AM, Matias Bjørling wrote:
>>>>> On 19/06/2020 11.41, [email protected] wrote:
>>>>>> Jens,
>>>>>>
>>>>>> Would you have time to answer a question below in this thread?
>>>>>>
>>>>>> On 18.06.2020 11:11, [email protected] wrote:
>>>>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>>>>
>>>>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>>>>>> IORING_OP_WRITE
>>>>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>>>>
>>>>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> fs/io_uring.c | 72
>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>>>>> --- a/fs/io_uring.c
>>>>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>>>>> unsigned long fsize;
>>>>>>>>>>> u64 user_data;
>>>>>>>>>>> u32 result;
>>>>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>>>>> + u32 append_offset;
>>>>>>>>>> this can overflow. u64 is needed.
>>>>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>>>>> only has space for u32 when we reuse the flags.
>>>>>>>>>
>>>>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>>>>> larger changes to io_uring for supporting append.
>>>>>>>>>
>>>>>>>>> Do you believe this is a better approach?
>>>>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>>>>> of sectors.
>>>>>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>>>>>> sectors can be
>>>>>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>>>>>> append result
>>>>>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>>>>>> size, you
>>>>>>>> need 31+9 bits... 32 does not cut it.
>>>>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>>>>> requirements, but if this is a no-go, we will take the longer path.
>>>>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>>>>> extend at least 32 bits. I believe this will need a whole new allocation
>>>>>> and probably ioctl().
>>>>>>
>>>>>> Is this an acceptable change for you? We will of course add support for
>>>>>> liburing when we agree on the right way to do this.
>>>>> I took a quick look at the code. No expert, but why not use the existing
>>>>> userdata variable? use the lowest bits (40 bits) for the Zone Starting
>>>>> LBA, and use the highest (24 bits) as index into the completion data
>>>>> structure?
>>>>>
>>>>> If you want to pass the memory address (same as what fio does) for the
>>>>> data structure used for completion, one may also play some tricks by
>>>>> using a relative memory address to the data structure. For example, the
>>>>> x86_64 architecture uses 48 address bits for its memory addresses. With
>>>>> 24 bit, one can allocate the completion entries in a 32MB memory range,
>>>>> and then use base_address + index to get back to the completion data
>>>>> structure specified in the sqe.
>>>> For any current request, sqe->user_data is just provided back as
>>>> cqe->user_data. This would make these requests behave differently
>>>> from everything else in that sense, which seems very confusing to me
>>>> if I was an application writer.
>>>>
>>>> But generally I do agree with you, there are lots of ways to make
>>>> < 64-bit work as a tag without losing anything or having to jump
>>>> through hoops to do so. The lack of consistency introduced by having
>>>> zone append work differently is ugly, though.
>>>>
>>> Yep, agree, and extending to three cachelines is big no-go. We could add
>>> a flag that said the kernel has changes the userdata variable. That'll
>>> make it very explicit.
>> Don't like that either, as it doesn't really change the fact that you're
>> now doing something very different with the user_data field, which is
>> just supposed to be passed in/out directly. Adding a random flag to
>> signal this behavior isn't very explicit either, imho. It's still some
>> out-of-band (ish) notification of behavior that is different from any
>> other command. This is very different from having a flag that says
>> "there's extra information in this other field", which is much cleaner.
>>
> Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you
> mention in the other mail. Sounds good.
I think that's the best approach, if we need > 32-bits. Maybe we can get
by just using ->res, if we switch to multiples of 512b instead for the
result like Pavel suggested. That'd provide enough room in ->res, and
would be preferable imho. But if we do need > 32-bits, then we can use
this approach.
--
Jens Axboe
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 15:02 ` Jens Axboe
@ 2020-06-21 18:52 ` [email protected]
0 siblings, 0 replies; 40+ messages in thread
From: [email protected] @ 2020-06-21 18:52 UTC (permalink / raw)
To: Jens Axboe
Cc: Pavel Begunkov, Damien Le Moal, Kanchan Joshi,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 19.06.2020 09:02, Jens Axboe wrote:
>On 6/19/20 8:59 AM, Pavel Begunkov wrote:
>> On 19/06/2020 17:15, Jens Axboe wrote:
>>> On 6/19/20 3:41 AM, [email protected] wrote:
>>>> Jens,
>>>>
>>>> Would you have time to answer a question below in this thread?
>>>>
>>>> On 18.06.2020 11:11, [email protected] wrote:
>>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>
>>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>>
>>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>>
>>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>>
>>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>>> ---
>>>>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>>> --- a/fs/io_uring.c
>>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>>> unsigned long fsize;
>>>>>>>>> u64 user_data;
>>>>>>>>> u32 result;
>>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>>> + u32 append_offset;
>>>>>>>>
>>>>>>>> this can overflow. u64 is needed.
>>>>>>>
>>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>>> only has space for u32 when we reuse the flags.
>>>>>>>
>>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>>> larger changes to io_uring for supporting append.
>>>>>>>
>>>>>>> Do you believe this is a better approach?
>>>>>>
>>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>>> of sectors. So any device that has a zone size smaller or equal to
>>>>>> 2^31 512B sectors can be accepted. Using a zone relative offset in
>>>>>> bytes for returning zone append result is OK-ish, but to match the
>>>>>> kernel supported range of possible zone size, you need 31+9 bits...
>>>>>> 32 does not cut it.
>>>>>
>>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>>> requirements, but if this is a no-go, we will take the longer path.
>>>>
>>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>>> extend at least 32 bits. I believe this will need a whole new allocation
>>>> and probably ioctl().
>>>>
>>>> Is this an acceptable change for you? We will of course add support for
>>>> liburing when we agree on the right way to do this.
>>>
>>> If you need 64-bit of return value, then it's not going to work. Even
>>> with the existing patches, reusing cqe->flags isn't going to fly, as
>>> it would conflict with eg doing zone append writes with automatic
>>> buffer selection.
>>
>> Buffer selection is for reads/recv kind of requests, but appends
>> are writes. In theory they can co-exist using cqe->flags.
>
>Yeah good point, since it's just writes, doesn't matter. But the other
>point still stands, it could potentially conflict with other flags, but
>I guess only to the extent where both flags would need extra storage in
>->flags. So not a huge concern imho.
Very good point Pavel!
If co-existing with the current flags is an option, I'll explore this
for the next version.
Thanks Jens and Pavel for the time and ideas!
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 3/3] io_uring: add support for zone-append
2020-06-19 15:44 ` Jens Axboe
@ 2020-06-21 18:55 ` [email protected]
0 siblings, 0 replies; 40+ messages in thread
From: [email protected] @ 2020-06-21 18:55 UTC (permalink / raw)
To: Jens Axboe
Cc: Matias Bjørling, Damien Le Moal, Kanchan Joshi,
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected]
On 19.06.2020 09:44, Jens Axboe wrote:
>On 6/19/20 9:40 AM, Matias Bjørling wrote:
>> On 19/06/2020 17.20, Jens Axboe wrote:
>>> On 6/19/20 9:14 AM, Matias Bjørling wrote:
>>>> On 19/06/2020 16.18, Jens Axboe wrote:
>>>>> On 6/19/20 5:15 AM, Matias Bjørling wrote:
>>>>>> On 19/06/2020 11.41, [email protected] wrote:
>>>>>>> Jens,
>>>>>>>
>>>>>>> Would you have time to answer a question below in this thread?
>>>>>>>
>>>>>>> On 18.06.2020 11:11, [email protected] wrote:
>>>>>>>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>>>>>>>> On 2020/06/18 17:35, [email protected] wrote:
>>>>>>>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>>>>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>>>>>>>> From: Selvakumar S <[email protected]>
>>>>>>>>>>>>
>>>>>>>>>>>> Introduce three new opcodes for zone-append -
>>>>>>>>>>>>
>>>>>>>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to
>>>>>>>>>>>> IORING_OP_WRITE
>>>>>>>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>>>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>>>>>>>
>>>>>>>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: SelvaKumar S <[email protected]>
>>>>>>>>>>>> Signed-off-by: Kanchan Joshi <[email protected]>
>>>>>>>>>>>> Signed-off-by: Nitesh Shetty <[email protected]>
>>>>>>>>>>>> Signed-off-by: Javier Gonzalez <[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>> fs/io_uring.c | 72
>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++--
>>>>>>>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>>>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>>>>>>>> index 155f3d8..c14c873 100644
>>>>>>>>>>>> --- a/fs/io_uring.c
>>>>>>>>>>>> +++ b/fs/io_uring.c
>>>>>>>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>>>>>>>> unsigned long fsize;
>>>>>>>>>>>> u64 user_data;
>>>>>>>>>>>> u32 result;
>>>>>>>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>>>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>>>>>>>> + u32 append_offset;
>>>>>>>>>>> this can overflow. u64 is needed.
>>>>>>>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>>>>>>>> only has space for u32 when we reuse the flags.
>>>>>>>>>>
>>>>>>>>>> We can of course create a new cqe structure, but that will come with
>>>>>>>>>> larger changes to io_uring for supporting append.
>>>>>>>>>>
>>>>>>>>>> Do you believe this is a better approach?
>>>>>>>>> The problem is that zone size are 32 bits in the kernel, as a number
>>>>>>>>> of sectors.
>>>>>>>>> So any device that has a zone size smaller or equal to 2^31 512B
>>>>>>>>> sectors can be
>>>>>>>>> accepted. Using a zone relative offset in bytes for returning zone
>>>>>>>>> append result
>>>>>>>>> is OK-ish, but to match the kernel supported range of possible zone
>>>>>>>>> size, you
>>>>>>>>> need 31+9 bits... 32 does not cut it.
>>>>>>>> Agree. Our initial assumption was that u32 would cover current zone size
>>>>>>>> requirements, but if this is a no-go, we will take the longer path.
>>>>>>> Converting to u64 will require a new version of io_uring_cqe, where we
>>>>>>> extend at least 32 bits. I believe this will need a whole new allocation
>>>>>>> and probably ioctl().
>>>>>>>
>>>>>>> Is this an acceptable change for you? We will of course add support for
>>>>>>> liburing when we agree on the right way to do this.
>>>>>> I took a quick look at the code. No expert, but why not use the existing
>>>>>> userdata variable? use the lowest bits (40 bits) for the Zone Starting
>>>>>> LBA, and use the highest (24 bits) as index into the completion data
>>>>>> structure?
>>>>>>
>>>>>> If you want to pass the memory address (same as what fio does) for the
>>>>>> data structure used for completion, one may also play some tricks by
>>>>>> using a relative memory address to the data structure. For example, the
>>>>>> x86_64 architecture uses 48 address bits for its memory addresses. With
>>>>>> 24 bit, one can allocate the completion entries in a 32MB memory range,
>>>>>> and then use base_address + index to get back to the completion data
>>>>>> structure specified in the sqe.
>>>>> For any current request, sqe->user_data is just provided back as
>>>>> cqe->user_data. This would make these requests behave differently
>>>>> from everything else in that sense, which seems very confusing to me
>>>>> if I was an application writer.
>>>>>
>>>>> But generally I do agree with you, there are lots of ways to make
>>>>> < 64-bit work as a tag without losing anything or having to jump
>>>>> through hoops to do so. The lack of consistency introduced by having
>>>>> zone append work differently is ugly, though.
>>>>>
>>>> Yep, agree, and extending to three cachelines is big no-go. We could add
>>>> a flag that said the kernel has changes the userdata variable. That'll
>>>> make it very explicit.
>>> Don't like that either, as it doesn't really change the fact that you're
>>> now doing something very different with the user_data field, which is
>>> just supposed to be passed in/out directly. Adding a random flag to
>>> signal this behavior isn't very explicit either, imho. It's still some
>>> out-of-band (ish) notification of behavior that is different from any
>>> other command. This is very different from having a flag that says
>>> "there's extra information in this other field", which is much cleaner.
>>>
>> Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you
>> mention in the other mail. Sounds good.
>
>I think that's the best approach, if we need > 32-bits. Maybe we can get
>by just using ->res, if we switch to multiples of 512b instead for the
>result like Pavel suggested. That'd provide enough room in ->res, and
>would be preferable imho. But if we do need > 32-bits, then we can use
>this approach.
Sounds good.
Thanks Matias too for chipping in with more ideas. We have enough for a
v2.
Javier
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2020-06-21 18:55 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20200617172653epcas5p488de50090415eb802e62acc0e23d8812@epcas5p4.samsung.com>
2020-06-17 17:23 ` [PATCH 0/3] zone-append support in aio and io-uring Kanchan Joshi
[not found] ` <CGME20200617172702epcas5p4dbf4729d31d9a85ab1d261d04f238e61@epcas5p4.samsung.com>
2020-06-17 17:23 ` [PATCH 1/3] fs,block: Introduce IOCB_ZONE_APPEND and direct-io handling Kanchan Joshi
2020-06-17 19:02 ` Pavel Begunkov
2020-06-18 7:16 ` Damien Le Moal
2020-06-18 18:35 ` Kanchan Joshi
[not found] ` <CGME20200617172706epcas5p4dcbc164063f58bad95b211b9d6dfbfa9@epcas5p4.samsung.com>
2020-06-17 17:23 ` [PATCH 2/3] aio: add support for zone-append Kanchan Joshi
2020-06-18 7:33 ` Damien Le Moal
[not found] ` <CGME20200617172713epcas5p352f2907a12bd4ee3c97be1c7d8e1569e@epcas5p3.samsung.com>
2020-06-17 17:23 ` [PATCH 3/3] io_uring: " Kanchan Joshi
2020-06-17 18:55 ` Pavel Begunkov
2020-06-18 7:39 ` Damien Le Moal
2020-06-18 8:35 ` [email protected]
2020-06-18 8:47 ` Damien Le Moal
2020-06-18 9:11 ` [email protected]
2020-06-19 9:41 ` [email protected]
2020-06-19 11:15 ` Matias Bjørling
2020-06-19 14:18 ` Jens Axboe
2020-06-19 15:14 ` Matias Bjørling
2020-06-19 15:20 ` Jens Axboe
2020-06-19 15:40 ` Matias Bjørling
2020-06-19 15:44 ` Jens Axboe
2020-06-21 18:55 ` [email protected]
2020-06-19 14:15 ` Jens Axboe
2020-06-19 14:59 ` Pavel Begunkov
2020-06-19 15:02 ` Jens Axboe
2020-06-21 18:52 ` [email protected]
2020-06-17 17:42 ` [PATCH 0/3] zone-append support in aio and io-uring Matthew Wilcox
2020-06-18 6:56 ` Christoph Hellwig
2020-06-18 8:29 ` Javier González
2020-06-18 17:52 ` Kanchan Joshi
2020-06-19 3:08 ` Damien Le Moal
2020-06-19 7:56 ` Christoph Hellwig
2020-06-18 8:04 ` Matias Bjørling
2020-06-18 8:27 ` Javier González
2020-06-18 8:32 ` Matias Bjørling
2020-06-18 8:39 ` Javier González
2020-06-18 8:46 ` Matias Bjørling
2020-06-18 14:16 ` Christoph Hellwig
2020-06-18 19:21 ` Kanchan Joshi
2020-06-18 20:04 ` Matias Bjørling
2020-06-19 1:03 ` Damien Le Moal
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox