* [PATCH v2 0/2] zone-append support in io-uring and aio [not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com> @ 2020-06-25 17:15 ` Kanchan Joshi [not found] ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com> ` (3 more replies) 0 siblings, 4 replies; 20+ messages in thread From: Kanchan Joshi @ 2020-06-25 17:15 UTC (permalink / raw) To: axboe, viro, bcrl Cc: asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] This patchset enables zone-append using io-uring/linux-aio, on block IO path. Purpose is to provide zone-append consumption ability to applications which are using zoned-block-device directly. The application may specify RWF_ZONE_APPEND flag with write when it wants to send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, aio, and pwritev2. An error is reported if zone-append is requested using pwritev2. It is not in the scope of this patchset to support pwritev2 or any other sync write API for reasons described later. Zone-append completion result ---> With zone-append, where write took place can only be known after completion. So apart from usual return value of write, additional mean is needed to obtain the actual written location. In aio, this is returned to application using res2 field of io_event - struct io_event { __u64 data; /* the data field from the iocb */ __u64 obj; /* what iocb this event came from */ __s64 res; /* result code for this event */ __s64 res2; /* secondary result */ }; In io-uring, cqe->flags is repurposed for zone-append result. struct io_uring_cqe { __u64 user_data; /* sqe->data submission passed back */ __s32 res; /* result code for this event */ __u32 flags; }; Since 32 bit flags is not sufficient, we choose to return zone-relative offset in sector/512b units. This can cover zone-size represented by chunk_sectors. Applications will have the trouble to combine this with zone start to know disk-relative offset. But if more bits are obtained by pulling from res field that too would compel application to interpret res field differently, and it seems more painstaking than the former option. To keep uniformity, even with aio, zone-relative offset is returned. Append using io_uring fixed-buffer ---> This is flagged as not-supported at the moment. Reason being, for fixed-buffer io-uring sends iov_iter of bvec type. But current append-infra in block-layer does not support such iov_iter. Block IO vs File IO ---> For now, the user zone-append interface is supported only for zoned-block-device. Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) will not need this anyway, because zone peculiarities are abstracted within FS. At this point, ZoneFS also likes to use append implicitly rather than explicitly. But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check allowing-only-block-device should be changed. Semantics ---> Zone-append, by its nature, may perform write on a different location than what was specified. It does not fit into POSIX, and trying to fit may just undermine its benefit. It may be better to keep semantics as close to zone-append as possible i.e. specify zone-start location, and obtain the actual-write location post completion. Towards that goal, existing async APIs seem to fit fine. Async APIs (uring, linux aio) do not work on implicit write-pointer and demand explicit write offset (which is what we need for append). Neither write-pointer is taken as input, nor it is updated on completion. And there is a clear way to get zone-append result. Zone-aware applications while using these async APIs can be fine with, for the lack of better word, zone-append semantics itself. Sync APIs work with implicit write-pointer (at least few of those), and there is no way to obtain zone-append result, making it hard for user-space zone-append. Tests ---> Using new interface in fio (uring and libaio engine) by extending zbd tests for zone-append: https://github.com/axboe/fio/pull/1026 Changes since v1: - No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead. - linux-aio changes vanish because of no new opcode - Fixed the overflow and other issues mentioned by Damien - Simplified uring support code, fixed the issues mentioned by Pavel - Added error checks Kanchan Joshi (1): fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Selvakumar S (1): io_uring: add support for zone-append fs/block_dev.c | 28 ++++++++++++++++++++++++---- fs/io_uring.c | 32 ++++++++++++++++++++++++++++++-- include/linux/fs.h | 9 +++++++++ include/uapi/linux/fs.h | 5 ++++- 4 files changed, 67 insertions(+), 7 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>]
* [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path [not found] ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com> @ 2020-06-25 17:15 ` Kanchan Joshi 2020-06-26 2:50 ` Damien Le Moal 2020-06-26 8:58 ` Christoph Hellwig 0 siblings, 2 replies; 20+ messages in thread From: Kanchan Joshi @ 2020-06-25 17:15 UTC (permalink / raw) To: axboe, viro, bcrl Cc: asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi, Arnav Dawn Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space sends this with write. Add IOCB_ZONE_APPEND which is set in kiocb->ki_flags on receiving RWF_ZONE_APPEND. Make direct IO submission path use IOCB_ZONE_APPEND to send bio with append op. Direct IO completion returns zone-relative offset, in sector unit, to upper layer using kiocb->ki_complete interface. Report error if zone-append is requested on regular file or on sync kiocb (i.e. one without ki_complete). Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: SelvaKumar S <[email protected]> Signed-off-by: Arnav Dawn <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> Signed-off-by: Javier Gonzalez <[email protected]> --- fs/block_dev.c | 28 ++++++++++++++++++++++++---- include/linux/fs.h | 9 +++++++++ include/uapi/linux/fs.h | 5 ++++- 3 files changed, 37 insertions(+), 5 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 47860e5..5180268 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) /* avoid the need for a I/O completion work item */ if (iocb->ki_flags & IOCB_DSYNC) op |= REQ_FUA; + + if (iocb->ki_flags & IOCB_ZONE_APPEND) + op |= REQ_OP_ZONE_APPEND; + return op; } @@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait) return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); } +static inline long blkdev_bio_end_io_append(struct bio *bio) +{ + sector_t zone_sectors = blk_queue_zone_sectors(bio->bi_disk->queue); + + /* calculate zone relative offset for zone append */ + return bio->bi_iter.bi_sector & (zone_sectors - 1); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -307,15 +319,19 @@ static void blkdev_bio_end_io(struct bio *bio) if (!dio->is_sync) { struct kiocb *iocb = dio->iocb; ssize_t ret; + long res2 = 0; if (likely(!dio->bio.bi_status)) { ret = dio->size; iocb->ki_pos += ret; + + if (iocb->ki_flags & IOCB_ZONE_APPEND) + res2 = blkdev_bio_end_io_append(bio); } else { ret = blk_status_to_errno(dio->bio.bi_status); } - dio->iocb->ki_complete(iocb, ret, 0); + dio->iocb->ki_complete(iocb, ret, res2); if (dio->multi_bio) bio_put(&dio->bio); } else { @@ -382,6 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; bio->bi_ioprio = iocb->ki_ioprio; + bio->bi_opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); ret = bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) { @@ -391,11 +408,9 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) } if (is_read) { - bio->bi_opf = REQ_OP_READ; if (dio->should_dirty) bio_set_pages_dirty(bio); } else { - bio->bi_opf = dio_bio_write_op(iocb); task_io_account_write(bio->bi_iter.bi_size); } @@ -465,12 +480,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { + bool is_sync = is_sync_kiocb(iocb); int nr_pages; + /* zone-append is supported only on async-kiocb */ + if (is_sync && iocb->ki_flags & IOCB_ZONE_APPEND) + return -EINVAL; + nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); if (!nr_pages) return 0; - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) + if (is_sync && nr_pages <= BIO_MAX_PAGES) return __blkdev_direct_IO_simple(iocb, iter, nr_pages); return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); diff --git a/include/linux/fs.h b/include/linux/fs.h index 6c4ab4d..3202d9a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -315,6 +315,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_ZONE_APPEND (1 << 8) struct kiocb { struct file *ki_filp; @@ -3456,6 +3457,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); if (flags & RWF_APPEND) ki->ki_flags |= IOCB_APPEND; + if (flags & RWF_ZONE_APPEND) { + /* currently support block device only */ + umode_t mode = file_inode(ki->ki_filp)->i_mode; + + if (!(S_ISBLK(mode))) + return -EOPNOTSUPP; + ki->ki_flags |= IOCB_ZONE_APPEND; + } return 0; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 379a612..1ce06e9 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* per-IO O_APPEND */ +#define RWF_ZONE_APPEND ((__force __kernel_rwf_t)0x00000020) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ZONE_APPEND) #endif /* _UAPI_LINUX_FS_H */ -- 2.7.4 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-25 17:15 ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi @ 2020-06-26 2:50 ` Damien Le Moal 2020-06-29 18:32 ` Kanchan Joshi 2020-06-26 8:58 ` Christoph Hellwig 1 sibling, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2020-06-26 2:50 UTC (permalink / raw) To: Kanchan Joshi, [email protected], [email protected], [email protected] Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn On 2020/06/26 2:18, Kanchan Joshi wrote: > Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space > sends this with write. Add IOCB_ZONE_APPEND which is set in > kiocb->ki_flags on receiving RWF_ZONE_APPEND. > Make direct IO submission path use IOCB_ZONE_APPEND to send bio with > append op. Direct IO completion returns zone-relative offset, in sector > unit, to upper layer using kiocb->ki_complete interface. > Report error if zone-append is requested on regular file or on sync > kiocb (i.e. one without ki_complete). > > Signed-off-by: Kanchan Joshi <[email protected]> > Signed-off-by: SelvaKumar S <[email protected]> > Signed-off-by: Arnav Dawn <[email protected]> > Signed-off-by: Nitesh Shetty <[email protected]> > Signed-off-by: Javier Gonzalez <[email protected]> > --- > fs/block_dev.c | 28 ++++++++++++++++++++++++---- > include/linux/fs.h | 9 +++++++++ > include/uapi/linux/fs.h | 5 ++++- > 3 files changed, 37 insertions(+), 5 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 47860e5..5180268 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) > /* avoid the need for a I/O completion work item */ > if (iocb->ki_flags & IOCB_DSYNC) > op |= REQ_FUA; > + > + if (iocb->ki_flags & IOCB_ZONE_APPEND) > + op |= REQ_OP_ZONE_APPEND; This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can this work ? > + > return op; > } > > @@ -295,6 +299,14 @@ static int blkdev_iopoll(struct kiocb *kiocb, bool wait) > return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); > } > > +static inline long blkdev_bio_end_io_append(struct bio *bio) > +{ > + sector_t zone_sectors = blk_queue_zone_sectors(bio->bi_disk->queue); > + > + /* calculate zone relative offset for zone append */ The name of the function may be better spelling out zone_append instead of just append. But see next comment. > + return bio->bi_iter.bi_sector & (zone_sectors - 1); > +} > + > static void blkdev_bio_end_io(struct bio *bio) > { > struct blkdev_dio *dio = bio->bi_private; > @@ -307,15 +319,19 @@ static void blkdev_bio_end_io(struct bio *bio) > if (!dio->is_sync) { > struct kiocb *iocb = dio->iocb; > ssize_t ret; > + long res2 = 0; > > if (likely(!dio->bio.bi_status)) { > ret = dio->size; > iocb->ki_pos += ret; > + Blank line not needed. > + if (iocb->ki_flags & IOCB_ZONE_APPEND)> + res2 = blkdev_bio_end_io_append(bio); The name blkdev_bio_end_io_append() implies a bio end_io callback function, which is not the case. What about naming this blkdev_bio_res2() and move the if inside it ? > } else { > ret = blk_status_to_errno(dio->bio.bi_status); add "res2 = 0;" here and drop the declaration initialization. That will avoid doing the assignment twice for zone append case. > } > > - dio->iocb->ki_complete(iocb, ret, 0); > + dio->iocb->ki_complete(iocb, ret, res2); > if (dio->multi_bio) > bio_put(&dio->bio); > } else { > @@ -382,6 +398,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > bio->bi_ioprio = iocb->ki_ioprio; > + bio->bi_opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb); Personally, I would prefer a plain "if () else". Or even better: change dio_bio_write_op() into dio_bio_op() and just have: bio->bi_opf = dio_bio_op(iocb); > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) { > @@ -391,11 +408,9 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) > } > > if (is_read) { > - bio->bi_opf = REQ_OP_READ; > if (dio->should_dirty) > bio_set_pages_dirty(bio); > } else { > - bio->bi_opf = dio_bio_write_op(iocb); > task_io_account_write(bio->bi_iter.bi_size); > } > > @@ -465,12 +480,17 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) > static ssize_t > blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) > { > + bool is_sync = is_sync_kiocb(iocb); > int nr_pages; > > + /* zone-append is supported only on async-kiocb */ > + if (is_sync && iocb->ki_flags & IOCB_ZONE_APPEND) > + return -EINVAL; > + > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES + 1); > if (!nr_pages) > return 0; > - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_PAGES) > + if (is_sync && nr_pages <= BIO_MAX_PAGES) > return __blkdev_direct_IO_simple(iocb, iter, nr_pages); > > return __blkdev_direct_IO(iocb, iter, min(nr_pages, BIO_MAX_PAGES)); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 6c4ab4d..3202d9a 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -315,6 +315,7 @@ enum rw_hint { > #define IOCB_SYNC (1 << 5) > #define IOCB_WRITE (1 << 6) > #define IOCB_NOWAIT (1 << 7) > +#define IOCB_ZONE_APPEND (1 << 8) > > struct kiocb { > struct file *ki_filp; > @@ -3456,6 +3457,14 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) > ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); > if (flags & RWF_APPEND) > ki->ki_flags |= IOCB_APPEND; > + if (flags & RWF_ZONE_APPEND) { > + /* currently support block device only */ > + umode_t mode = file_inode(ki->ki_filp)->i_mode; > + > + if (!(S_ISBLK(mode))) > + return -EOPNOTSUPP; > + ki->ki_flags |= IOCB_ZONE_APPEND; > + } > return 0; > } > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index 379a612..1ce06e9 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -299,8 +299,11 @@ typedef int __bitwise __kernel_rwf_t; > /* per-IO O_APPEND */ > #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) > > +/* per-IO O_APPEND */ > +#define RWF_ZONE_APPEND ((__force __kernel_rwf_t)0x00000020) > + > /* mask of flags supported by the kernel */ > #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ > - RWF_APPEND) > + RWF_APPEND | RWF_ZONE_APPEND) > > #endif /* _UAPI_LINUX_FS_H */ > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-26 2:50 ` Damien Le Moal @ 2020-06-29 18:32 ` Kanchan Joshi 2020-06-30 0:37 ` Damien Le Moal 0 siblings, 1 reply; 20+ messages in thread From: Kanchan Joshi @ 2020-06-29 18:32 UTC (permalink / raw) To: Damien Le Moal Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn [-- Attachment #1: Type: text/plain, Size: 1831 bytes --] On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >On 2020/06/26 2:18, Kanchan Joshi wrote: >> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >> sends this with write. Add IOCB_ZONE_APPEND which is set in >> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >> append op. Direct IO completion returns zone-relative offset, in sector >> unit, to upper layer using kiocb->ki_complete interface. >> Report error if zone-append is requested on regular file or on sync >> kiocb (i.e. one without ki_complete). >> >> Signed-off-by: Kanchan Joshi <[email protected]> >> Signed-off-by: SelvaKumar S <[email protected]> >> Signed-off-by: Arnav Dawn <[email protected]> >> Signed-off-by: Nitesh Shetty <[email protected]> >> Signed-off-by: Javier Gonzalez <[email protected]> >> --- >> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >> include/linux/fs.h | 9 +++++++++ >> include/uapi/linux/fs.h | 5 ++++- >> 3 files changed, 37 insertions(+), 5 deletions(-) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 47860e5..5180268 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >> /* avoid the need for a I/O completion work item */ >> if (iocb->ki_flags & IOCB_DSYNC) >> op |= REQ_FUA; >> + >> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >> + op |= REQ_OP_ZONE_APPEND; > >This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >this work ? REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op flags (REQ_FUA etc.) will be retained. But yes, this can be made to look cleaner. V3 will include the other changes you pointed out. Thanks for the review. [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-29 18:32 ` Kanchan Joshi @ 2020-06-30 0:37 ` Damien Le Moal 2020-06-30 7:40 ` Kanchan Joshi 0 siblings, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2020-06-30 0:37 UTC (permalink / raw) To: Kanchan Joshi Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn On 2020/06/30 3:35, Kanchan Joshi wrote: > On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >> On 2020/06/26 2:18, Kanchan Joshi wrote: >>> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >>> sends this with write. Add IOCB_ZONE_APPEND which is set in >>> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >>> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >>> append op. Direct IO completion returns zone-relative offset, in sector >>> unit, to upper layer using kiocb->ki_complete interface. >>> Report error if zone-append is requested on regular file or on sync >>> kiocb (i.e. one without ki_complete). >>> >>> Signed-off-by: Kanchan Joshi <[email protected]> >>> Signed-off-by: SelvaKumar S <[email protected]> >>> Signed-off-by: Arnav Dawn <[email protected]> >>> Signed-off-by: Nitesh Shetty <[email protected]> >>> Signed-off-by: Javier Gonzalez <[email protected]> >>> --- >>> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >>> include/linux/fs.h | 9 +++++++++ >>> include/uapi/linux/fs.h | 5 ++++- >>> 3 files changed, 37 insertions(+), 5 deletions(-) >>> >>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>> index 47860e5..5180268 100644 >>> --- a/fs/block_dev.c >>> +++ b/fs/block_dev.c >>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >>> /* avoid the need for a I/O completion work item */ >>> if (iocb->ki_flags & IOCB_DSYNC) >>> op |= REQ_FUA; >>> + >>> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >>> + op |= REQ_OP_ZONE_APPEND; >> >> This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >> this work ? > REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op > flags (REQ_FUA etc.) will be retained. But yes, this can be made to look > cleaner. > V3 will include the other changes you pointed out. Thanks for the review. > REQ_OP_WRITE and REQ_OP_ZONE_APPEND are different bits, so there is no "override". A well formed BIO bi_opf is one op+flags. Specifying multiple OP codes does not make sense. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-30 0:37 ` Damien Le Moal @ 2020-06-30 7:40 ` Kanchan Joshi 2020-06-30 7:52 ` Damien Le Moal 0 siblings, 1 reply; 20+ messages in thread From: Kanchan Joshi @ 2020-06-30 7:40 UTC (permalink / raw) To: Damien Le Moal Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn [-- Attachment #1: Type: text/plain, Size: 2494 bytes --] On Tue, Jun 30, 2020 at 12:37:07AM +0000, Damien Le Moal wrote: >On 2020/06/30 3:35, Kanchan Joshi wrote: >> On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >>> On 2020/06/26 2:18, Kanchan Joshi wrote: >>>> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >>>> sends this with write. Add IOCB_ZONE_APPEND which is set in >>>> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >>>> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >>>> append op. Direct IO completion returns zone-relative offset, in sector >>>> unit, to upper layer using kiocb->ki_complete interface. >>>> Report error if zone-append is requested on regular file or on sync >>>> kiocb (i.e. one without ki_complete). >>>> >>>> Signed-off-by: Kanchan Joshi <[email protected]> >>>> Signed-off-by: SelvaKumar S <[email protected]> >>>> Signed-off-by: Arnav Dawn <[email protected]> >>>> Signed-off-by: Nitesh Shetty <[email protected]> >>>> Signed-off-by: Javier Gonzalez <[email protected]> >>>> --- >>>> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >>>> include/linux/fs.h | 9 +++++++++ >>>> include/uapi/linux/fs.h | 5 ++++- >>>> 3 files changed, 37 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>>> index 47860e5..5180268 100644 >>>> --- a/fs/block_dev.c >>>> +++ b/fs/block_dev.c >>>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >>>> /* avoid the need for a I/O completion work item */ >>>> if (iocb->ki_flags & IOCB_DSYNC) >>>> op |= REQ_FUA; >>>> + >>>> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >>>> + op |= REQ_OP_ZONE_APPEND; >>> >>> This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >>> this work ? >> REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op >> flags (REQ_FUA etc.) will be retained. But yes, this can be made to look >> cleaner. >> V3 will include the other changes you pointed out. Thanks for the review. >> > >REQ_OP_WRITE and REQ_OP_ZONE_APPEND are different bits, so there is no >"override". A well formed BIO bi_opf is one op+flags. Specifying multiple OP >codes does not make sense. one op+flags behavior is retained here. OP is not about bits (op flags are). Had it been, REQ_OP_WRITE (value 1) can not be differentiated from REQ_OP_ZONE_APPEND (value 13). We do not do "bio_op(bio) & REQ_OP_WRITE", rather we look at the absolute value "bio_op(bio) == REQ_OP_WRITE". [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-30 7:40 ` Kanchan Joshi @ 2020-06-30 7:52 ` Damien Le Moal 2020-06-30 7:56 ` Damien Le Moal 0 siblings, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2020-06-30 7:52 UTC (permalink / raw) To: Kanchan Joshi Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn On 2020/06/30 16:43, Kanchan Joshi wrote: > On Tue, Jun 30, 2020 at 12:37:07AM +0000, Damien Le Moal wrote: >> On 2020/06/30 3:35, Kanchan Joshi wrote: >>> On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >>>> On 2020/06/26 2:18, Kanchan Joshi wrote: >>>>> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >>>>> sends this with write. Add IOCB_ZONE_APPEND which is set in >>>>> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >>>>> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >>>>> append op. Direct IO completion returns zone-relative offset, in sector >>>>> unit, to upper layer using kiocb->ki_complete interface. >>>>> Report error if zone-append is requested on regular file or on sync >>>>> kiocb (i.e. one without ki_complete). >>>>> >>>>> Signed-off-by: Kanchan Joshi <[email protected]> >>>>> Signed-off-by: SelvaKumar S <[email protected]> >>>>> Signed-off-by: Arnav Dawn <[email protected]> >>>>> Signed-off-by: Nitesh Shetty <[email protected]> >>>>> Signed-off-by: Javier Gonzalez <[email protected]> >>>>> --- >>>>> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >>>>> include/linux/fs.h | 9 +++++++++ >>>>> include/uapi/linux/fs.h | 5 ++++- >>>>> 3 files changed, 37 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>>>> index 47860e5..5180268 100644 >>>>> --- a/fs/block_dev.c >>>>> +++ b/fs/block_dev.c >>>>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >>>>> /* avoid the need for a I/O completion work item */ >>>>> if (iocb->ki_flags & IOCB_DSYNC) >>>>> op |= REQ_FUA; >>>>> + >>>>> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >>>>> + op |= REQ_OP_ZONE_APPEND; >>>> >>>> This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >>>> this work ? >>> REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op >>> flags (REQ_FUA etc.) will be retained. But yes, this can be made to look >>> cleaner. >>> V3 will include the other changes you pointed out. Thanks for the review. >>> >> >> REQ_OP_WRITE and REQ_OP_ZONE_APPEND are different bits, so there is no >> "override". A well formed BIO bi_opf is one op+flags. Specifying multiple OP >> codes does not make sense. > > one op+flags behavior is retained here. OP is not about bits (op flags are). > Had it been, REQ_OP_WRITE (value 1) can not be differentiated from > REQ_OP_ZONE_APPEND (value 13). > We do not do "bio_op(bio) & REQ_OP_WRITE", rather we look at the > absolute value "bio_op(bio) == REQ_OP_WRITE". Sure, the ops are not bits like the flags, but (excluding the flags) doing: op |= REQ_OP_ZONE_APPEND; will give you op == (REQ_OP_WRITE | REQ_OP_ZONE_APPEND). That's not what you want... -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-30 7:52 ` Damien Le Moal @ 2020-06-30 7:56 ` Damien Le Moal 2020-06-30 8:16 ` Kanchan Joshi 0 siblings, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2020-06-30 7:56 UTC (permalink / raw) To: Kanchan Joshi Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn On 2020/06/30 16:53, Damien Le Moal wrote: > On 2020/06/30 16:43, Kanchan Joshi wrote: >> On Tue, Jun 30, 2020 at 12:37:07AM +0000, Damien Le Moal wrote: >>> On 2020/06/30 3:35, Kanchan Joshi wrote: >>>> On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >>>>> On 2020/06/26 2:18, Kanchan Joshi wrote: >>>>>> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >>>>>> sends this with write. Add IOCB_ZONE_APPEND which is set in >>>>>> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >>>>>> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >>>>>> append op. Direct IO completion returns zone-relative offset, in sector >>>>>> unit, to upper layer using kiocb->ki_complete interface. >>>>>> Report error if zone-append is requested on regular file or on sync >>>>>> kiocb (i.e. one without ki_complete). >>>>>> >>>>>> Signed-off-by: Kanchan Joshi <[email protected]> >>>>>> Signed-off-by: SelvaKumar S <[email protected]> >>>>>> Signed-off-by: Arnav Dawn <[email protected]> >>>>>> Signed-off-by: Nitesh Shetty <[email protected]> >>>>>> Signed-off-by: Javier Gonzalez <[email protected]> >>>>>> --- >>>>>> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >>>>>> include/linux/fs.h | 9 +++++++++ >>>>>> include/uapi/linux/fs.h | 5 ++++- >>>>>> 3 files changed, 37 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>>>>> index 47860e5..5180268 100644 >>>>>> --- a/fs/block_dev.c >>>>>> +++ b/fs/block_dev.c >>>>>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >>>>>> /* avoid the need for a I/O completion work item */ >>>>>> if (iocb->ki_flags & IOCB_DSYNC) >>>>>> op |= REQ_FUA; >>>>>> + >>>>>> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >>>>>> + op |= REQ_OP_ZONE_APPEND; >>>>> >>>>> This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >>>>> this work ? >>>> REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op >>>> flags (REQ_FUA etc.) will be retained. But yes, this can be made to look >>>> cleaner. >>>> V3 will include the other changes you pointed out. Thanks for the review. >>>> >>> >>> REQ_OP_WRITE and REQ_OP_ZONE_APPEND are different bits, so there is no >>> "override". A well formed BIO bi_opf is one op+flags. Specifying multiple OP >>> codes does not make sense. >> >> one op+flags behavior is retained here. OP is not about bits (op flags are). >> Had it been, REQ_OP_WRITE (value 1) can not be differentiated from >> REQ_OP_ZONE_APPEND (value 13). >> We do not do "bio_op(bio) & REQ_OP_WRITE", rather we look at the >> absolute value "bio_op(bio) == REQ_OP_WRITE". > > Sure, the ops are not bits like the flags, but (excluding the flags) doing: > > op |= REQ_OP_ZONE_APPEND; > > will give you op == (REQ_OP_WRITE | REQ_OP_ZONE_APPEND). That's not what you want... And yes, REQ_OP_WRITE | REQ_OP_ZONE_APPEND == REQ_OP_ZONE_APPEND... But still not a reason for not setting the op correctly :) -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-30 7:56 ` Damien Le Moal @ 2020-06-30 8:16 ` Kanchan Joshi 0 siblings, 0 replies; 20+ messages in thread From: Kanchan Joshi @ 2020-06-30 8:16 UTC (permalink / raw) To: Damien Le Moal Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], Arnav Dawn [-- Attachment #1: Type: text/plain, Size: 3342 bytes --] On Tue, Jun 30, 2020 at 07:56:46AM +0000, Damien Le Moal wrote: >On 2020/06/30 16:53, Damien Le Moal wrote: >> On 2020/06/30 16:43, Kanchan Joshi wrote: >>> On Tue, Jun 30, 2020 at 12:37:07AM +0000, Damien Le Moal wrote: >>>> On 2020/06/30 3:35, Kanchan Joshi wrote: >>>>> On Fri, Jun 26, 2020 at 02:50:20AM +0000, Damien Le Moal wrote: >>>>>> On 2020/06/26 2:18, Kanchan Joshi wrote: >>>>>>> Introduce RWF_ZONE_APPEND flag to represent zone-append. User-space >>>>>>> sends this with write. Add IOCB_ZONE_APPEND which is set in >>>>>>> kiocb->ki_flags on receiving RWF_ZONE_APPEND. >>>>>>> Make direct IO submission path use IOCB_ZONE_APPEND to send bio with >>>>>>> append op. Direct IO completion returns zone-relative offset, in sector >>>>>>> unit, to upper layer using kiocb->ki_complete interface. >>>>>>> Report error if zone-append is requested on regular file or on sync >>>>>>> kiocb (i.e. one without ki_complete). >>>>>>> >>>>>>> Signed-off-by: Kanchan Joshi <[email protected]> >>>>>>> Signed-off-by: SelvaKumar S <[email protected]> >>>>>>> Signed-off-by: Arnav Dawn <[email protected]> >>>>>>> Signed-off-by: Nitesh Shetty <[email protected]> >>>>>>> Signed-off-by: Javier Gonzalez <[email protected]> >>>>>>> --- >>>>>>> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >>>>>>> include/linux/fs.h | 9 +++++++++ >>>>>>> include/uapi/linux/fs.h | 5 ++++- >>>>>>> 3 files changed, 37 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>>>>>> index 47860e5..5180268 100644 >>>>>>> --- a/fs/block_dev.c >>>>>>> +++ b/fs/block_dev.c >>>>>>> @@ -185,6 +185,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) >>>>>>> /* avoid the need for a I/O completion work item */ >>>>>>> if (iocb->ki_flags & IOCB_DSYNC) >>>>>>> op |= REQ_FUA; >>>>>>> + >>>>>>> + if (iocb->ki_flags & IOCB_ZONE_APPEND) >>>>>>> + op |= REQ_OP_ZONE_APPEND; >>>>>> >>>>>> This is wrong. REQ_OP_WRITE is already set in the declaration of "op". How can >>>>>> this work ? >>>>> REQ_OP_ZONE_APPEND will override the REQ_WRITE op, while previously set op >>>>> flags (REQ_FUA etc.) will be retained. But yes, this can be made to look >>>>> cleaner. >>>>> V3 will include the other changes you pointed out. Thanks for the review. >>>>> >>>> >>>> REQ_OP_WRITE and REQ_OP_ZONE_APPEND are different bits, so there is no >>>> "override". A well formed BIO bi_opf is one op+flags. Specifying multiple OP >>>> codes does not make sense. >>> >>> one op+flags behavior is retained here. OP is not about bits (op flags are). >>> Had it been, REQ_OP_WRITE (value 1) can not be differentiated from >>> REQ_OP_ZONE_APPEND (value 13). >>> We do not do "bio_op(bio) & REQ_OP_WRITE", rather we look at the >>> absolute value "bio_op(bio) == REQ_OP_WRITE". >> >> Sure, the ops are not bits like the flags, but (excluding the flags) doing: >> >> op |= REQ_OP_ZONE_APPEND; >> >> will give you op == (REQ_OP_WRITE | REQ_OP_ZONE_APPEND). That's not what you want... > >And yes, REQ_OP_WRITE | REQ_OP_ZONE_APPEND == REQ_OP_ZONE_APPEND... But still >not a reason for not setting the op correctly :) Right, this is what op override was about, and intent was to minimize the changes in the existing function. As said before, completely agree about changing, code should not draw suspicion. [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-25 17:15 ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi 2020-06-26 2:50 ` Damien Le Moal @ 2020-06-26 8:58 ` Christoph Hellwig 2020-06-26 21:15 ` Kanchan Joshi 1 sibling, 1 reply; 20+ messages in thread From: Christoph Hellwig @ 2020-06-26 8:58 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, viro, bcrl, asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Arnav Dawn To restate my previous NAK: A low-level protocol detail like RWF_ZONE_APPEND has absolutely no business being exposed in the Linux file system interface. And as mentioned before I think the idea of returning the actual position written for O_APPEND writes totally makes sense, and actually is generalizable to all files. Together with zonefs that gives you a perfect interface for zone append. On Thu, Jun 25, 2020 at 10:45:48PM +0530, Kanchan Joshi wrote: > Introduce RWF_ZONE_APPEND flag to represent zone-append. And no one but us select few even know what zone append is, nevermind what the detailed semantics are. If you add a userspace API you need to very clearly document the semantics inluding errors and corner cases. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-26 8:58 ` Christoph Hellwig @ 2020-06-26 21:15 ` Kanchan Joshi 2020-06-27 6:51 ` Christoph Hellwig 0 siblings, 1 reply; 20+ messages in thread From: Kanchan Joshi @ 2020-06-26 21:15 UTC (permalink / raw) To: Christoph Hellwig Cc: axboe, viro, bcrl, asml.silence, Damien.LeMoal, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Arnav Dawn [-- Attachment #1: Type: text/plain, Size: 2283 bytes --] On Fri, Jun 26, 2020 at 09:58:46AM +0100, Christoph Hellwig wrote: >To restate my previous NAK: > >A low-level protocol detail like RWF_ZONE_APPEND has absolutely no >business being exposed in the Linux file system interface. > >And as mentioned before I think the idea of returning the actual >position written for O_APPEND writes totally makes sense, and actually >is generalizable to all files. Together with zonefs that gives you a >perfect interface for zone append. > >On Thu, Jun 25, 2020 at 10:45:48PM +0530, Kanchan Joshi wrote: >> Introduce RWF_ZONE_APPEND flag to represent zone-append. > >And no one but us select few even know what zone append is, nevermind >what the detailed semantics are. If you add a userspace API you need >to very clearly document the semantics inluding errors and corner cases. For block IO path (which is the scope of this patchset) there is no probelm in using RWF_APPEND for zone-append, because it does not do anything for block device. We can use that, avoiding introduction of RWF_ZONE_APPEND in user-space. In kernel, will it be fine to keep IOCB_ZONE_APPEND apart from IOCB_APPEND? Reason being, this can help to isolate the code meant only for zone-append from the one that is already present for conventional append. Snippet from quick reference - static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); if (flags & RWF_APPEND) ki->ki_flags |= IOCB_APPEND; + if (flags & RWF_ZONE_APPEND) { + /* currently support block device only */ + umode_t mode = file_inode(ki->ki_filp)->i_mode; + + if (!(S_ISBLK(mode))) + return -EOPNOTSUPP; + ki->ki_flags |= IOCB_ZONE_APPEND; + } As for file I/O in future, I see a potential problem with RWF_APPEND. In io_uring, zone-append requires bit of pre/post processing, which ideally should be done only for zone-append case. A ZoneFS file using RWF_APPEND as a mean to invoke zone-append vs a regular file (hosted on some other FS) requiring conventional RWF_APPEND - both will execute that processing. Is there a good way to differentiate ZoneFS file from another file which only wants use conventional file-append? [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path 2020-06-26 21:15 ` Kanchan Joshi @ 2020-06-27 6:51 ` Christoph Hellwig 0 siblings, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2020-06-27 6:51 UTC (permalink / raw) To: Kanchan Joshi Cc: Christoph Hellwig, axboe, viro, bcrl, asml.silence, Damien.LeMoal, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Arnav Dawn On Sat, Jun 27, 2020 at 02:45:14AM +0530, Kanchan Joshi wrote: > For block IO path (which is the scope of this patchset) there is no > probelm in using RWF_APPEND for zone-append, because it does not do > anything for block device. We can use that, avoiding introduction of > RWF_ZONE_APPEND in user-space. No, you are not just touching the block I/O path. This is all over the general file code, and all RWF_* flag are about file I/O. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>]
* [PATCH v2 2/2] io_uring: add support for zone-append [not found] ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com> @ 2020-06-25 17:15 ` Kanchan Joshi 2020-06-25 19:40 ` Pavel Begunkov 0 siblings, 1 reply; 20+ messages in thread From: Kanchan Joshi @ 2020-06-25 17:15 UTC (permalink / raw) To: axboe, viro, bcrl Cc: asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz, Kanchan Joshi From: Selvakumar S <[email protected]> For zone-append, block-layer will return zone-relative offset via ret2 of ki_complete interface. Make changes to collect it, and send to user-space using ceq->flags. Detect and report early error if zone-append is requested with fixed-buffers. Signed-off-by: Selvakumar S <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Signed-off-by: Nitesh Shetty <[email protected]> Signed-off-by: Javier Gonzalez <[email protected]> --- fs/io_uring.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 155f3d8..31a9da58 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -402,6 +402,8 @@ struct io_rw { struct kiocb kiocb; u64 addr; u64 len; + /* zone-relative offset for append, in sectors */ + u32 append_offset; }; struct io_connect { @@ -541,6 +543,7 @@ enum { REQ_F_NO_FILE_TABLE_BIT, REQ_F_QUEUE_TIMEOUT_BIT, REQ_F_WORK_INITIALIZED_BIT, + REQ_F_ZONE_APPEND_BIT, /* not a real bit, just to check we're not overflowing the space */ __REQ_F_LAST_BIT, @@ -598,6 +601,8 @@ enum { REQ_F_QUEUE_TIMEOUT = BIT(REQ_F_QUEUE_TIMEOUT_BIT), /* io_wq_work is initialized */ REQ_F_WORK_INITIALIZED = BIT(REQ_F_WORK_INITIALIZED_BIT), + /* to return zone relative offset for zone append*/ + REQ_F_ZONE_APPEND = BIT(REQ_F_ZONE_APPEND_BIT), }; struct async_poll { @@ -1745,6 +1750,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); + if (req->flags & REQ_F_ZONE_APPEND) + cflags = req->rw.append_offset; __io_cqring_fill_event(req, req->result, cflags); (*nr_events)++; @@ -1943,7 +1950,7 @@ static inline void req_set_fail_links(struct io_kiocb *req) req->flags |= REQ_F_FAIL_LINK; } -static void io_complete_rw_common(struct kiocb *kiocb, long res) +static void io_complete_rw_common(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); int cflags = 0; @@ -1953,8 +1960,14 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res) if (res != req->result) req_set_fail_links(req); + if (req->flags & REQ_F_BUFFER_SELECTED) cflags = io_put_kbuf(req); + + /* use cflags to return zone append completion result */ + if (req->flags & REQ_F_ZONE_APPEND) + cflags = res2; + __io_cqring_add_event(req, res, cflags); } @@ -1962,7 +1975,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); - io_complete_rw_common(kiocb, res); + io_complete_rw_common(kiocb, res, res2); io_put_req(req); } @@ -1975,6 +1988,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) if (res != req->result) req_set_fail_links(req); + if (req->flags & REQ_F_ZONE_APPEND) + req->rw.append_offset = res2; + req->result = res; if (res != -EAGAIN) WRITE_ONCE(req->iopoll_completed, 1); @@ -2127,6 +2143,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (kiocb->ki_flags & IOCB_NOWAIT) req->flags |= REQ_F_NOWAIT; + if (kiocb->ki_flags & IOCB_ZONE_APPEND) + req->flags |= REQ_F_ZONE_APPEND; + if (force_nonblock) kiocb->ki_flags |= IOCB_NOWAIT; @@ -2409,6 +2428,14 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, opcode = req->opcode; if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { + /* + * fixed-buffers not supported for zone-append. + * This check can be removed when block-layer starts + * supporting append with iov_iter of bvec type + */ + if (req->flags == REQ_F_ZONE_APPEND) + return -EINVAL; + *iovec = NULL; return io_import_fixed(req, rw, iter); } @@ -2704,6 +2731,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT; req->result = 0; + io_size = ret; if (req->flags & REQ_F_LINK_HEAD) req->result = io_size; -- 2.7.4 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 2/2] io_uring: add support for zone-append 2020-06-25 17:15 ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi @ 2020-06-25 19:40 ` Pavel Begunkov 0 siblings, 0 replies; 20+ messages in thread From: Pavel Begunkov @ 2020-06-25 19:40 UTC (permalink / raw) To: Kanchan Joshi, axboe, viro, bcrl Cc: Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz On 25/06/2020 20:15, Kanchan Joshi wrote: > From: Selvakumar S <[email protected]> > > For zone-append, block-layer will return zone-relative offset via ret2 > of ki_complete interface. Make changes to collect it, and send to > user-space using ceq->flags. > Detect and report early error if zone-append is requested with > fixed-buffers. > > Signed-off-by: Selvakumar S <[email protected]> > Signed-off-by: Kanchan Joshi <[email protected]> > Signed-off-by: Nitesh Shetty <[email protected]> > Signed-off-by: Javier Gonzalez <[email protected]> > --- > fs/io_uring.c | 32 ++++++++++++++++++++++++++++++-- > 1 file changed, 30 insertions(+), 2 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 155f3d8..31a9da58 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -402,6 +402,8 @@ struct io_rw { > struct kiocb kiocb; > u64 addr; > u64 len; > + /* zone-relative offset for append, in sectors */ > + u32 append_offset; > }; > > struct io_connect { > @@ -541,6 +543,7 @@ enum { > REQ_F_NO_FILE_TABLE_BIT, > REQ_F_QUEUE_TIMEOUT_BIT, > REQ_F_WORK_INITIALIZED_BIT, > + REQ_F_ZONE_APPEND_BIT, > > /* not a real bit, just to check we're not overflowing the space */ > __REQ_F_LAST_BIT, > @@ -598,6 +601,8 @@ enum { > REQ_F_QUEUE_TIMEOUT = BIT(REQ_F_QUEUE_TIMEOUT_BIT), > /* io_wq_work is initialized */ > REQ_F_WORK_INITIALIZED = BIT(REQ_F_WORK_INITIALIZED_BIT), > + /* to return zone relative offset for zone append*/ > + REQ_F_ZONE_APPEND = BIT(REQ_F_ZONE_APPEND_BIT), Do we need a new flag? We can check for IOCB_ZONE_APPEND, flags are always close by in req->rw.kiocb.ki_flags. May require to be careful about not setting it for read, so not screwing buf select. > }; > > struct async_poll { > @@ -1745,6 +1750,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, > > if (req->flags & REQ_F_BUFFER_SELECTED) > cflags = io_put_kbuf(req); > + if (req->flags & REQ_F_ZONE_APPEND) > + cflags = req->rw.append_offset; > > __io_cqring_fill_event(req, req->result, cflags); > (*nr_events)++; > @@ -1943,7 +1950,7 @@ static inline void req_set_fail_links(struct io_kiocb *req) > req->flags |= REQ_F_FAIL_LINK; > } > > -static void io_complete_rw_common(struct kiocb *kiocb, long res) > +static void io_complete_rw_common(struct kiocb *kiocb, long res, long res2) > { > struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); > int cflags = 0; > @@ -1953,8 +1960,14 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res) > > if (res != req->result) > req_set_fail_links(req); > + > if (req->flags & REQ_F_BUFFER_SELECTED) > cflags = io_put_kbuf(req); > + > + /* use cflags to return zone append completion result */ > + if (req->flags & REQ_F_ZONE_APPEND) > + cflags = res2; > + > __io_cqring_add_event(req, res, cflags); > } > > @@ -1962,7 +1975,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) > { > struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb); > > - io_complete_rw_common(kiocb, res); > + io_complete_rw_common(kiocb, res, res2); > io_put_req(req); > } > > @@ -1975,6 +1988,9 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) > > if (res != req->result) > req_set_fail_links(req); > + if (req->flags & REQ_F_ZONE_APPEND) > + req->rw.append_offset = res2; > + > req->result = res; > if (res != -EAGAIN) > WRITE_ONCE(req->iopoll_completed, 1); > @@ -2127,6 +2143,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, > if (kiocb->ki_flags & IOCB_NOWAIT) > req->flags |= REQ_F_NOWAIT; > > + if (kiocb->ki_flags & IOCB_ZONE_APPEND) > + req->flags |= REQ_F_ZONE_APPEND; > + > if (force_nonblock) > kiocb->ki_flags |= IOCB_NOWAIT; > > @@ -2409,6 +2428,14 @@ static ssize_t io_import_iovec(int rw, struct io_kiocb *req, > > opcode = req->opcode; > if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { > + /* > + * fixed-buffers not supported for zone-append. > + * This check can be removed when block-layer starts > + * supporting append with iov_iter of bvec type > + */ > + if (req->flags == REQ_F_ZONE_APPEND) s/==/&/ > + return -EINVAL; > + > *iovec = NULL; > return io_import_fixed(req, rw, iter); > } > @@ -2704,6 +2731,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) > req->rw.kiocb.ki_flags &= ~IOCB_NOWAIT; > > req->result = 0; > + Extra \n > io_size = ret; > if (req->flags & REQ_F_LINK_HEAD) > req->result = io_size; > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-25 17:15 ` [PATCH v2 0/2] zone-append support in io-uring and aio Kanchan Joshi [not found] ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com> [not found] ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com> @ 2020-06-26 3:11 ` Damien Le Moal 2020-06-26 6:37 ` javier.gonz 2020-06-26 22:15 ` Kanchan Joshi 2020-06-30 12:46 ` Matthew Wilcox 3 siblings, 2 replies; 20+ messages in thread From: Damien Le Moal @ 2020-06-26 3:11 UTC (permalink / raw) To: Kanchan Joshi, [email protected], [email protected], [email protected] Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 2020/06/26 2:18, Kanchan Joshi wrote: > [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] > > This patchset enables zone-append using io-uring/linux-aio, on block IO path. > Purpose is to provide zone-append consumption ability to applications which are > using zoned-block-device directly. > > The application may specify RWF_ZONE_APPEND flag with write when it wants to > send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, > aio, and pwritev2. An error is reported if zone-append is requested using > pwritev2. It is not in the scope of this patchset to support pwritev2 or any > other sync write API for reasons described later. > > Zone-append completion result ---> > With zone-append, where write took place can only be known after completion. > So apart from usual return value of write, additional mean is needed to obtain > the actual written location. > > In aio, this is returned to application using res2 field of io_event - > > struct io_event { > __u64 data; /* the data field from the iocb */ > __u64 obj; /* what iocb this event came from */ > __s64 res; /* result code for this event */ > __s64 res2; /* secondary result */ > }; > > In io-uring, cqe->flags is repurposed for zone-append result. > > struct io_uring_cqe { > __u64 user_data; /* sqe->data submission passed back */ > __s32 res; /* result code for this event */ > __u32 flags; > }; > > Since 32 bit flags is not sufficient, we choose to return zone-relative offset > in sector/512b units. This can cover zone-size represented by chunk_sectors. > Applications will have the trouble to combine this with zone start to know > disk-relative offset. But if more bits are obtained by pulling from res field > that too would compel application to interpret res field differently, and it > seems more painstaking than the former option. > To keep uniformity, even with aio, zone-relative offset is returned. I am really not a fan of this, to say the least. The input is byte offset, the output is 512B relative sector count... Arg... We really cannot do better than that ? At the very least, byte relative offset ? The main reason is that this is _somewhat_ acceptable for raw block device accesses since the "sector" abstraction has a clear meaning, but once we add iomap/zonefs async zone append support, we really will want to have byte unit as the interface is regular files, not block device file. We could argue that 512B sector unit is still around even for files (e.g. block counts in file stat). Bu the different unit for input and output of one operation is really ugly. This is not nice for the user. > > Append using io_uring fixed-buffer ---> > This is flagged as not-supported at the moment. Reason being, for fixed-buffer > io-uring sends iov_iter of bvec type. But current append-infra in block-layer > does not support such iov_iter. > > Block IO vs File IO ---> > For now, the user zone-append interface is supported only for zoned-block-device. > Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) > will not need this anyway, because zone peculiarities are abstracted within FS. > At this point, ZoneFS also likes to use append implicitly rather than explicitly. > But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check > allowing-only-block-device should be changed. Sure, but I think the interface is still a problem. I am not super happy about the 512B sector unit. Zonefs will be the only file system that will be impacted since other normal POSIX file system will not have zone append interface for users. So this is a limited problem. Still, even for raw block device files accesses, POSIX system calls use Byte unit everywhere. Let's try to use that. For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed already, we can still 8 bits from the cqe res. All you need is to add a small helper function in userspace iouring.h to simplify the work of the application to get that result. > > Semantics ---> > Zone-append, by its nature, may perform write on a different location than what > was specified. It does not fit into POSIX, and trying to fit may just undermine > its benefit. It may be better to keep semantics as close to zone-append as > possible i.e. specify zone-start location, and obtain the actual-write location > post completion. Towards that goal, existing async APIs seem to fit fine. > Async APIs (uring, linux aio) do not work on implicit write-pointer and demand > explicit write offset (which is what we need for append). Neither write-pointer What do you mean by "implicit write pointer" ? Are you referring to the behavior of AIO write with a block device file open with O_APPEND ? The yes, it does not work. But that is perfectly fine for regular files, that is for zonefs. I would prefer that this paragraph simply state the semantic that is implemented first. Then explain why the choice. But first, clarify how the API works, what is allowed, what's not etc. That will also simplify reviewing the code as one can then check the code against the goal. > is taken as input, nor it is updated on completion. And there is a clear way to > get zone-append result. Zone-aware applications while using these async APIs > can be fine with, for the lack of better word, zone-append semantics itself. > > Sync APIs work with implicit write-pointer (at least few of those), and there is > no way to obtain zone-append result, making it hard for user-space zone-append. Sync API are executed under inode lock, at least for regular files. So there is absolutely no problem to use zone append. zonefs does it already. The problem is the lack of locking for block device file. > > Tests ---> > Using new interface in fio (uring and libaio engine) by extending zbd tests > for zone-append: https://github.com/axboe/fio/pull/1026 > > Changes since v1: > - No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead. > - linux-aio changes vanish because of no new opcode > - Fixed the overflow and other issues mentioned by Damien > - Simplified uring support code, fixed the issues mentioned by Pavel > - Added error checks > > Kanchan Joshi (1): > fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path > > Selvakumar S (1): > io_uring: add support for zone-append > > fs/block_dev.c | 28 ++++++++++++++++++++++++---- > fs/io_uring.c | 32 ++++++++++++++++++++++++++++++-- > include/linux/fs.h | 9 +++++++++ > include/uapi/linux/fs.h | 5 ++++- > 4 files changed, 67 insertions(+), 7 deletions(-) > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-26 3:11 ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal @ 2020-06-26 6:37 ` javier.gonz 2020-06-26 6:56 ` Damien Le Moal 2020-06-26 22:15 ` Kanchan Joshi 1 sibling, 1 reply; 20+ messages in thread From: javier.gonz @ 2020-06-26 6:37 UTC (permalink / raw) To: Damien Le Moal Cc: Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] [-- Attachment #1: Type: text/plain, Size: 4719 bytes --] On 26.06.2020 03:11, Damien Le Moal wrote: >On 2020/06/26 2:18, Kanchan Joshi wrote: >> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] >> >> This patchset enables zone-append using io-uring/linux-aio, on block IO path. >> Purpose is to provide zone-append consumption ability to applications which are >> using zoned-block-device directly. >> >> The application may specify RWF_ZONE_APPEND flag with write when it wants to >> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, >> aio, and pwritev2. An error is reported if zone-append is requested using >> pwritev2. It is not in the scope of this patchset to support pwritev2 or any >> other sync write API for reasons described later. >> >> Zone-append completion result ---> >> With zone-append, where write took place can only be known after completion. >> So apart from usual return value of write, additional mean is needed to obtain >> the actual written location. >> >> In aio, this is returned to application using res2 field of io_event - >> >> struct io_event { >> __u64 data; /* the data field from the iocb */ >> __u64 obj; /* what iocb this event came from */ >> __s64 res; /* result code for this event */ >> __s64 res2; /* secondary result */ >> }; >> >> In io-uring, cqe->flags is repurposed for zone-append result. >> >> struct io_uring_cqe { >> __u64 user_data; /* sqe->data submission passed back */ >> __s32 res; /* result code for this event */ >> __u32 flags; >> }; >> >> Since 32 bit flags is not sufficient, we choose to return zone-relative offset >> in sector/512b units. This can cover zone-size represented by chunk_sectors. >> Applications will have the trouble to combine this with zone start to know >> disk-relative offset. But if more bits are obtained by pulling from res field >> that too would compel application to interpret res field differently, and it >> seems more painstaking than the former option. >> To keep uniformity, even with aio, zone-relative offset is returned. > >I am really not a fan of this, to say the least. The input is byte offset, the >output is 512B relative sector count... Arg... We really cannot do better than >that ? > >At the very least, byte relative offset ? The main reason is that this is >_somewhat_ acceptable for raw block device accesses since the "sector" >abstraction has a clear meaning, but once we add iomap/zonefs async zone append >support, we really will want to have byte unit as the interface is regular >files, not block device file. We could argue that 512B sector unit is still >around even for files (e.g. block counts in file stat). Bu the different unit >for input and output of one operation is really ugly. This is not nice for the user. > You can refer to the discussion with Jens, Pavel and Alex on the uring interface. With the bits we have and considering the maximun zone size supported, there is no space for a byte relative offset. We can take some bits from cqe->res, but we were afraid this is not very future-proof. Do you have a better idea? >> >> Append using io_uring fixed-buffer ---> >> This is flagged as not-supported at the moment. Reason being, for fixed-buffer >> io-uring sends iov_iter of bvec type. But current append-infra in block-layer >> does not support such iov_iter. >> >> Block IO vs File IO ---> >> For now, the user zone-append interface is supported only for zoned-block-device. >> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) >> will not need this anyway, because zone peculiarities are abstracted within FS. >> At this point, ZoneFS also likes to use append implicitly rather than explicitly. >> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check >> allowing-only-block-device should be changed. > >Sure, but I think the interface is still a problem. I am not super happy about >the 512B sector unit. Zonefs will be the only file system that will be impacted >since other normal POSIX file system will not have zone append interface for >users. So this is a limited problem. Still, even for raw block device files >accesses, POSIX system calls use Byte unit everywhere. Let's try to use that. > >For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed >already, we can still 8 bits from the cqe res. All you need is to add a small >helper function in userspace iouring.h to simplify the work of the application >to get that result. Ok. See above. We can do this. Jens: Do you see this as a problem in the future? [...] Javier [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-26 6:37 ` javier.gonz @ 2020-06-26 6:56 ` Damien Le Moal 2020-06-26 7:03 ` [email protected] 0 siblings, 1 reply; 20+ messages in thread From: Damien Le Moal @ 2020-06-26 6:56 UTC (permalink / raw) To: [email protected] Cc: Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 2020/06/26 15:37, [email protected] wrote: > On 26.06.2020 03:11, Damien Le Moal wrote: >> On 2020/06/26 2:18, Kanchan Joshi wrote: >>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] >>> >>> This patchset enables zone-append using io-uring/linux-aio, on block IO path. >>> Purpose is to provide zone-append consumption ability to applications which are >>> using zoned-block-device directly. >>> >>> The application may specify RWF_ZONE_APPEND flag with write when it wants to >>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, >>> aio, and pwritev2. An error is reported if zone-append is requested using >>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any >>> other sync write API for reasons described later. >>> >>> Zone-append completion result ---> >>> With zone-append, where write took place can only be known after completion. >>> So apart from usual return value of write, additional mean is needed to obtain >>> the actual written location. >>> >>> In aio, this is returned to application using res2 field of io_event - >>> >>> struct io_event { >>> __u64 data; /* the data field from the iocb */ >>> __u64 obj; /* what iocb this event came from */ >>> __s64 res; /* result code for this event */ >>> __s64 res2; /* secondary result */ >>> }; >>> >>> In io-uring, cqe->flags is repurposed for zone-append result. >>> >>> struct io_uring_cqe { >>> __u64 user_data; /* sqe->data submission passed back */ >>> __s32 res; /* result code for this event */ >>> __u32 flags; >>> }; >>> >>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset >>> in sector/512b units. This can cover zone-size represented by chunk_sectors. >>> Applications will have the trouble to combine this with zone start to know >>> disk-relative offset. But if more bits are obtained by pulling from res field >>> that too would compel application to interpret res field differently, and it >>> seems more painstaking than the former option. >>> To keep uniformity, even with aio, zone-relative offset is returned. >> >> I am really not a fan of this, to say the least. The input is byte offset, the >> output is 512B relative sector count... Arg... We really cannot do better than >> that ? >> >> At the very least, byte relative offset ? The main reason is that this is >> _somewhat_ acceptable for raw block device accesses since the "sector" >> abstraction has a clear meaning, but once we add iomap/zonefs async zone append >> support, we really will want to have byte unit as the interface is regular >> files, not block device file. We could argue that 512B sector unit is still >> around even for files (e.g. block counts in file stat). Bu the different unit >> for input and output of one operation is really ugly. This is not nice for the user. >> > > You can refer to the discussion with Jens, Pavel and Alex on the uring > interface. With the bits we have and considering the maximun zone size > supported, there is no space for a byte relative offset. We can take > some bits from cqe->res, but we were afraid this is not very > future-proof. Do you have a better idea? If you can take 8 bits, that gives you 40 bits, enough to support byte relative offsets for any zone size defined as a number of 512B sectors using an unsigned int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am already too tired and my math is failing me... zone size is defined by chunk_sectors, which is used for raid and software raids too. This has been an unsigned int forever. I do not see the need for changing this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB does not really make any sense. Same for zone size... > > >>> >>> Append using io_uring fixed-buffer ---> >>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer >>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer >>> does not support such iov_iter. >>> >>> Block IO vs File IO ---> >>> For now, the user zone-append interface is supported only for zoned-block-device. >>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) >>> will not need this anyway, because zone peculiarities are abstracted within FS. >>> At this point, ZoneFS also likes to use append implicitly rather than explicitly. >>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check >>> allowing-only-block-device should be changed. >> >> Sure, but I think the interface is still a problem. I am not super happy about >> the 512B sector unit. Zonefs will be the only file system that will be impacted >> since other normal POSIX file system will not have zone append interface for >> users. So this is a limited problem. Still, even for raw block device files >> accesses, POSIX system calls use Byte unit everywhere. Let's try to use that. >> >> For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed >> already, we can still 8 bits from the cqe res. All you need is to add a small >> helper function in userspace iouring.h to simplify the work of the application >> to get that result. > > Ok. See above. We can do this. > > Jens: Do you see this as a problem in the future? > > [...] > > Javier > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-26 6:56 ` Damien Le Moal @ 2020-06-26 7:03 ` [email protected] 0 siblings, 0 replies; 20+ messages in thread From: [email protected] @ 2020-06-26 7:03 UTC (permalink / raw) To: Damien Le Moal Cc: Kanchan Joshi, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] On 26.06.2020 06:56, Damien Le Moal wrote: >On 2020/06/26 15:37, [email protected] wrote: >> On 26.06.2020 03:11, Damien Le Moal wrote: >>> On 2020/06/26 2:18, Kanchan Joshi wrote: >>>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] >>>> >>>> This patchset enables zone-append using io-uring/linux-aio, on block IO path. >>>> Purpose is to provide zone-append consumption ability to applications which are >>>> using zoned-block-device directly. >>>> >>>> The application may specify RWF_ZONE_APPEND flag with write when it wants to >>>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, >>>> aio, and pwritev2. An error is reported if zone-append is requested using >>>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any >>>> other sync write API for reasons described later. >>>> >>>> Zone-append completion result ---> >>>> With zone-append, where write took place can only be known after completion. >>>> So apart from usual return value of write, additional mean is needed to obtain >>>> the actual written location. >>>> >>>> In aio, this is returned to application using res2 field of io_event - >>>> >>>> struct io_event { >>>> __u64 data; /* the data field from the iocb */ >>>> __u64 obj; /* what iocb this event came from */ >>>> __s64 res; /* result code for this event */ >>>> __s64 res2; /* secondary result */ >>>> }; >>>> >>>> In io-uring, cqe->flags is repurposed for zone-append result. >>>> >>>> struct io_uring_cqe { >>>> __u64 user_data; /* sqe->data submission passed back */ >>>> __s32 res; /* result code for this event */ >>>> __u32 flags; >>>> }; >>>> >>>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset >>>> in sector/512b units. This can cover zone-size represented by chunk_sectors. >>>> Applications will have the trouble to combine this with zone start to know >>>> disk-relative offset. But if more bits are obtained by pulling from res field >>>> that too would compel application to interpret res field differently, and it >>>> seems more painstaking than the former option. >>>> To keep uniformity, even with aio, zone-relative offset is returned. >>> >>> I am really not a fan of this, to say the least. The input is byte offset, the >>> output is 512B relative sector count... Arg... We really cannot do better than >>> that ? >>> >>> At the very least, byte relative offset ? The main reason is that this is >>> _somewhat_ acceptable for raw block device accesses since the "sector" >>> abstraction has a clear meaning, but once we add iomap/zonefs async zone append >>> support, we really will want to have byte unit as the interface is regular >>> files, not block device file. We could argue that 512B sector unit is still >>> around even for files (e.g. block counts in file stat). Bu the different unit >>> for input and output of one operation is really ugly. This is not nice for the user. >>> >> >> You can refer to the discussion with Jens, Pavel and Alex on the uring >> interface. With the bits we have and considering the maximun zone size >> supported, there is no space for a byte relative offset. We can take >> some bits from cqe->res, but we were afraid this is not very >> future-proof. Do you have a better idea? > >If you can take 8 bits, that gives you 40 bits, enough to support byte relative >offsets for any zone size defined as a number of 512B sectors using an unsigned >int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am >already too tired and my math is failing me... Yes, the match is correct. I was thinking more of the bits being needed for other use-case that could collide with append. We considered this and discard it for being messy - when Pavel brought up the 512B alignment we saw it as a good alternative. Note too that we would be able to translate to a byte offset in iouring.h too so the user would not need to think of this. I do not feel strongly on this, so the one that better fits the current and near-future for uring, that is the one we will send on V3. Will give it until next week for others to comment too. > >zone size is defined by chunk_sectors, which is used for raid and software raids >too. This has been an unsigned int forever. I do not see the need for changing >this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB >does not really make any sense. Same for zone size... Yes. I think already max zone sizes are pretty huge. But yes, this might change, so we will take it when it happens. [...] Javier ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-26 3:11 ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal 2020-06-26 6:37 ` javier.gonz @ 2020-06-26 22:15 ` Kanchan Joshi 1 sibling, 0 replies; 20+ messages in thread From: Kanchan Joshi @ 2020-06-26 22:15 UTC (permalink / raw) To: Damien Le Moal Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] [-- Attachment #1: Type: text/plain, Size: 3497 bytes --] On Fri, Jun 26, 2020 at 03:11:55AM +0000, Damien Le Moal wrote: >On 2020/06/26 2:18, Kanchan Joshi wrote: >> Semantics ---> >> Zone-append, by its nature, may perform write on a different location than what >> was specified. It does not fit into POSIX, and trying to fit may just undermine >> its benefit. It may be better to keep semantics as close to zone-append as >> possible i.e. specify zone-start location, and obtain the actual-write location >> post completion. Towards that goal, existing async APIs seem to fit fine. >> Async APIs (uring, linux aio) do not work on implicit write-pointer and demand >> explicit write offset (which is what we need for append). Neither write-pointer > >What do you mean by "implicit write pointer" ? Are you referring to the behavior >of AIO write with a block device file open with O_APPEND ? The yes, it does not >work. But that is perfectly fine for regular files, that is for zonefs. Sorry, I meant file pointer. Yes, block-device opened with O_APPEND does not increase the file-pointer to end-of-device. That said, for uring and aio, file-pointer position plays no role, and it is application responsibility to pass the right write location. >I would prefer that this paragraph simply state the semantic that is implemented >first. Then explain why the choice. But first, clarify how the API works, what >is allowed, what's not etc. That will also simplify reviewing the code as one >can then check the code against the goal. In this path (block IO) there is hardly any scope/attempt to abstract away anything. So raw zoned-storage rule/semantics apply. I expect zone-aware applications, which are already aware of rules, to be consumer of this. >> is taken as input, nor it is updated on completion. And there is a clear way to >> get zone-append result. Zone-aware applications while using these async APIs >> can be fine with, for the lack of better word, zone-append semantics itself. >> >> Sync APIs work with implicit write-pointer (at least few of those), and there is >> no way to obtain zone-append result, making it hard for user-space zone-append. > >Sync API are executed under inode lock, at least for regular files. So there is >absolutely no problem to use zone append. zonefs does it already. The problem is >the lack of locking for block device file. Yes. I was refering to the problem of returning actual write-location using sync APIs like write, pwrite, pwritev/v2. >> >> Tests ---> >> Using new interface in fio (uring and libaio engine) by extending zbd tests >> for zone-append: https://protect2.fireeye.com/url?k=e21dd5e0-bf837b7a-e21c5eaf-0cc47a336fae-c982437ed1be6cc8&q=1&u=https%3A%2F%2Fgithub.com%2Faxboe%2Ffio%2Fpull%2F1026 >> >> Changes since v1: >> - No new opcodes in uring or aio. Use RWF_ZONE_APPEND flag instead. >> - linux-aio changes vanish because of no new opcode >> - Fixed the overflow and other issues mentioned by Damien >> - Simplified uring support code, fixed the issues mentioned by Pavel >> - Added error checks >> >> Kanchan Joshi (1): >> fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path >> >> Selvakumar S (1): >> io_uring: add support for zone-append >> >> fs/block_dev.c | 28 ++++++++++++++++++++++++---- >> fs/io_uring.c | 32 ++++++++++++++++++++++++++++++-- >> include/linux/fs.h | 9 +++++++++ >> include/uapi/linux/fs.h | 5 ++++- >> 4 files changed, 67 insertions(+), 7 deletions(-) >> > > >-- >Damien Le Moal >Western Digital Research > [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/2] zone-append support in io-uring and aio 2020-06-25 17:15 ` [PATCH v2 0/2] zone-append support in io-uring and aio Kanchan Joshi ` (2 preceding siblings ...) 2020-06-26 3:11 ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal @ 2020-06-30 12:46 ` Matthew Wilcox 3 siblings, 0 replies; 20+ messages in thread From: Matthew Wilcox @ 2020-06-30 12:46 UTC (permalink / raw) To: Kanchan Joshi Cc: axboe, viro, bcrl, asml.silence, Damien.LeMoal, hch, linux-fsdevel, mb, linux-kernel, linux-aio, io-uring, linux-block, selvakuma.s1, nj.shetty, javier.gonz On Thu, Jun 25, 2020 at 10:45:47PM +0530, Kanchan Joshi wrote: > Zone-append completion result ---> > With zone-append, where write took place can only be known after completion. > So apart from usual return value of write, additional mean is needed to obtain > the actual written location. > > In aio, this is returned to application using res2 field of io_event - > > struct io_event { > __u64 data; /* the data field from the iocb */ > __u64 obj; /* what iocb this event came from */ > __s64 res; /* result code for this event */ > __s64 res2; /* secondary result */ > }; Ah, now I understand. I think you're being a little too specific by calling this zone-append. This is really a "write-anywhere" operation, and the specified address is only a hint. > In io-uring, cqe->flags is repurposed for zone-append result. > > struct io_uring_cqe { > __u64 user_data; /* sqe->data submission passed back */ > __s32 res; /* result code for this event */ > __u32 flags; > }; > > Since 32 bit flags is not sufficient, we choose to return zone-relative offset > in sector/512b units. This can cover zone-size represented by chunk_sectors. > Applications will have the trouble to combine this with zone start to know > disk-relative offset. But if more bits are obtained by pulling from res field > that too would compel application to interpret res field differently, and it > seems more painstaking than the former option. > To keep uniformity, even with aio, zone-relative offset is returned. Urgh, no, that's dreadful. I'm not familiar with the io_uring code. Maybe the first 8 bytes of the user_data could be required to be the result offset for this submission type? > Block IO vs File IO ---> > For now, the user zone-append interface is supported only for zoned-block-device. > Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) > will not need this anyway, because zone peculiarities are abstracted within FS. > At this point, ZoneFS also likes to use append implicitly rather than explicitly. > But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check > allowing-only-block-device should be changed. But we also have O_APPEND files. And maybe we'll have other kinds of file in future for which this would make sense. > Semantics ---> > Zone-append, by its nature, may perform write on a different location than what > was specified. It does not fit into POSIX, and trying to fit may just undermine ... I disagree that it doesn't fit into POSIX. As I said above, O_APPEND is a POSIX concept, so POSIX already understands that writes may not end up at the current write pointer. > its benefit. It may be better to keep semantics as close to zone-append as > possible i.e. specify zone-start location, and obtain the actual-write location > post completion. Towards that goal, existing async APIs seem to fit fine. > Async APIs (uring, linux aio) do not work on implicit write-pointer and demand > explicit write offset (which is what we need for append). Neither write-pointer > is taken as input, nor it is updated on completion. And there is a clear way to > get zone-append result. Zone-aware applications while using these async APIs > can be fine with, for the lack of better word, zone-append semantics itself. > > Sync APIs work with implicit write-pointer (at least few of those), and there is > no way to obtain zone-append result, making it hard for user-space zone-append. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2020-06-30 12:46 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
2020-06-25 17:15 ` [PATCH v2 0/2] zone-append support in io-uring and aio Kanchan Joshi
[not found] ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
2020-06-25 17:15 ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi
2020-06-26 2:50 ` Damien Le Moal
2020-06-29 18:32 ` Kanchan Joshi
2020-06-30 0:37 ` Damien Le Moal
2020-06-30 7:40 ` Kanchan Joshi
2020-06-30 7:52 ` Damien Le Moal
2020-06-30 7:56 ` Damien Le Moal
2020-06-30 8:16 ` Kanchan Joshi
2020-06-26 8:58 ` Christoph Hellwig
2020-06-26 21:15 ` Kanchan Joshi
2020-06-27 6:51 ` Christoph Hellwig
[not found] ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>
2020-06-25 17:15 ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi
2020-06-25 19:40 ` Pavel Begunkov
2020-06-26 3:11 ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal
2020-06-26 6:37 ` javier.gonz
2020-06-26 6:56 ` Damien Le Moal
2020-06-26 7:03 ` [email protected]
2020-06-26 22:15 ` Kanchan Joshi
2020-06-30 12:46 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox