* madvise/fadvise 32-bit length @ 2024-06-01 9:43 Stefan 2024-06-01 14:19 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Stefan @ 2024-06-01 9:43 UTC (permalink / raw) To: io-uring io_uring uses the __u32 len field in order to pass the length to madvise and fadvise, but these calls use an off_t, which is 64bit on 64bit platforms. When using liburing, the length is silently truncated to 32bits (so 8GB length would become zero, which has a different meaning of "until the end of the file" for fadvise). If my understanding is correct, we could fix this by introducing new operations MADVISE64 and FADVISE64, which use the addr3 field instead of the length field for length. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 9:43 madvise/fadvise 32-bit length Stefan @ 2024-06-01 14:19 ` Jens Axboe 2024-06-01 15:05 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Jens Axboe @ 2024-06-01 14:19 UTC (permalink / raw) To: Stefan, io-uring On 6/1/24 3:43 AM, Stefan wrote: > io_uring uses the __u32 len field in order to pass the length to > madvise and fadvise, but these calls use an off_t, which is 64bit on > 64bit platforms. > > When using liburing, the length is silently truncated to 32bits (so > 8GB length would become zero, which has a different meaning of "until > the end of the file" for fadvise). > > If my understanding is correct, we could fix this by introducing new > operations MADVISE64 and FADVISE64, which use the addr3 field instead > of the length field for length. We probably just want to introduce a flag and ensure that older stable kernels check it, and then use a 64-bit field for it when the flag is set. -- Jens Axboe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 14:19 ` Jens Axboe @ 2024-06-01 15:05 ` Jens Axboe 2024-06-01 15:22 ` Stefan 0 siblings, 1 reply; 10+ messages in thread From: Jens Axboe @ 2024-06-01 15:05 UTC (permalink / raw) To: Stefan, io-uring On 6/1/24 8:19 AM, Jens Axboe wrote: > On 6/1/24 3:43 AM, Stefan wrote: >> io_uring uses the __u32 len field in order to pass the length to >> madvise and fadvise, but these calls use an off_t, which is 64bit on >> 64bit platforms. >> >> When using liburing, the length is silently truncated to 32bits (so >> 8GB length would become zero, which has a different meaning of "until >> the end of the file" for fadvise). >> >> If my understanding is correct, we could fix this by introducing new >> operations MADVISE64 and FADVISE64, which use the addr3 field instead >> of the length field for length. > > We probably just want to introduce a flag and ensure that older stable > kernels check it, and then use a 64-bit field for it when the flag is > set. I think this should do it on the kernel side, as we already check these fields and return -EINVAL as needed. Should also be trivial to backport. Totally untested... Might want a FEAT flag for this, or something where it's detectable, to make the liburing change straight forward. diff --git a/io_uring/advise.c b/io_uring/advise.c index 7085804c513c..cb7b881665e5 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -17,14 +17,14 @@ struct io_fadvise { struct file *file; u64 offset; - u32 len; + u64 len; u32 advice; }; struct io_madvise { struct file *file; u64 addr; - u32 len; + u64 len; u32 advice; }; @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) + if (sqe->buf_index || sqe->splice_fd_in) return -EINVAL; ma->addr = READ_ONCE(sqe->addr); - ma->len = READ_ONCE(sqe->len); + ma->len = READ_ONCE(sqe->off); + if (!ma->len) + ma->len = READ_ONCE(sqe->len); ma->advice = READ_ONCE(sqe->fadvise_advice); req->flags |= REQ_F_FORCE_ASYNC; return 0; @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) + if (sqe->buf_index || sqe->splice_fd_in) return -EINVAL; fa->offset = READ_ONCE(sqe->off); - fa->len = READ_ONCE(sqe->len); + fa->len = READ_ONCE(sqe->addr); + if (!fa->len) + fa->len = READ_ONCE(sqe->len); fa->advice = READ_ONCE(sqe->fadvise_advice); if (io_fadvise_force_async(fa)) req->flags |= REQ_F_FORCE_ASYNC; -- Jens Axboe ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 15:05 ` Jens Axboe @ 2024-06-01 15:22 ` Stefan 2024-06-01 15:35 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Stefan @ 2024-06-01 15:22 UTC (permalink / raw) To: Jens Axboe, io-uring On 1/6/2024 17:05, Jens Axboe wrote: > On 6/1/24 8:19 AM, Jens Axboe wrote: >> On 6/1/24 3:43 AM, Stefan wrote: >>> io_uring uses the __u32 len field in order to pass the length to >>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>> 64bit platforms. >>> >>> When using liburing, the length is silently truncated to 32bits (so >>> 8GB length would become zero, which has a different meaning of "until >>> the end of the file" for fadvise). >>> >>> If my understanding is correct, we could fix this by introducing new >>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>> of the length field for length. >> >> We probably just want to introduce a flag and ensure that older stable >> kernels check it, and then use a 64-bit field for it when the flag is >> set. > > I think this should do it on the kernel side, as we already check these > fields and return -EINVAL as needed. Should also be trivial to backport. > Totally untested... Might want a FEAT flag for this, or something where > it's detectable, to make the liburing change straight forward. > > > diff --git a/io_uring/advise.c b/io_uring/advise.c > index 7085804c513c..cb7b881665e5 100644 > --- a/io_uring/advise.c > +++ b/io_uring/advise.c > @@ -17,14 +17,14 @@ > struct io_fadvise { > struct file *file; > u64 offset; > - u32 len; > + u64 len; > u32 advice; > }; > > struct io_madvise { > struct file *file; > u64 addr; > - u32 len; > + u64 len; > u32 advice; > }; > > @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) > struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); > > - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) > + if (sqe->buf_index || sqe->splice_fd_in) > return -EINVAL; > > ma->addr = READ_ONCE(sqe->addr); > - ma->len = READ_ONCE(sqe->len); > + ma->len = READ_ONCE(sqe->off); > + if (!ma->len) > + ma->len = READ_ONCE(sqe->len); > ma->advice = READ_ONCE(sqe->fadvise_advice); > req->flags |= REQ_F_FORCE_ASYNC; > return 0; > @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > { > struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); > > - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) > + if (sqe->buf_index || sqe->splice_fd_in) > return -EINVAL; > > fa->offset = READ_ONCE(sqe->off); > - fa->len = READ_ONCE(sqe->len); > + fa->len = READ_ONCE(sqe->addr); > + if (!fa->len) > + fa->len = READ_ONCE(sqe->len); > fa->advice = READ_ONCE(sqe->fadvise_advice); > if (io_fadvise_force_async(fa)) > req->flags |= REQ_F_FORCE_ASYNC; > If we want to have the length in the same field in both *ADVISE operations, we can put a flag in splice_fd_in/optlen. Maybe the explicit flag is a bit clearer for users of the API compared to the implicit flag when setting sqe->len to zero? diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 994bf7af0efe..70794ac1471e 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -415,6 +415,14 @@ enum io_uring_msg_ring_flags { */ #define IORING_NOP_INJECT_RESULT (1U << 0) +/* + * IORING_OP_FADVISE and IORING_OP_MADVISE flags (stored in sqe->optlen) + * + * IORING_ADVISE_LEN64 Use 64-bit length stored in sqe->addr3 + * + */ +#define IORING_ADVISE_LEN64 (1U << 0) + /* * IO completion data structure (Completion Queue Entry) */ diff --git a/io_uring/advise.c b/io_uring/advise.c index 7085804c513c..f229c751adbc 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -17,14 +17,14 @@ struct io_fadvise { struct file *file; u64 offset; - u32 len; + u64 len; u32 advice; }; struct io_madvise { struct file *file; u64 addr; - u32 len; + u64 len; u32 advice; }; @@ -32,12 +32,26 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); + u32 flags; - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) + if (sqe->buf_index || sqe->off) return -EINVAL; + flags = READ_ONCE(sqe->optlen); + + if (flags & ~IORING_ADVISE_LEN64) + return -EINVAL; + + if (flags & IORING_ADVISE_LEN64) { + if (sqe->len) + return -EINVAL; + + ma->len = READ_ONCE(sqe->addr3); + } else { + ma->len = READ_ONCE(sqe->len); + } + ma->addr = READ_ONCE(sqe->addr); - ma->len = READ_ONCE(sqe->len); ma->advice = READ_ONCE(sqe->fadvise_advice); req->flags |= REQ_F_FORCE_ASYNC; return 0; @@ -77,12 +91,26 @@ static bool io_fadvise_force_async(struct io_fadvise *fa) int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); + u32 flags; - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) + if (sqe->buf_index || sqe->addr) return -EINVAL; + flags = READ_ONCE(sqe->optlen); + + if (flags & ~IORING_ADVISE_LEN64) + return -EINVAL; + + if (flags & IORING_ADVISE_LEN64) { + if (sqe->len) + return -EINVAL; + + fa->len = READ_ONCE(sqe->addr3); + } else { + fa->len = READ_ONCE(sqe->len); + } + fa->offset = READ_ONCE(sqe->off); - fa->len = READ_ONCE(sqe->len); fa->advice = READ_ONCE(sqe->fadvise_advice); if (io_fadvise_force_async(fa)) req->flags |= REQ_F_FORCE_ASYNC; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 15:22 ` Stefan @ 2024-06-01 15:35 ` Jens Axboe 2024-06-01 15:51 ` Stefan 0 siblings, 1 reply; 10+ messages in thread From: Jens Axboe @ 2024-06-01 15:35 UTC (permalink / raw) To: Stefan, io-uring [-- Attachment #1: Type: text/plain, Size: 4098 bytes --] On 6/1/24 9:22 AM, Stefan wrote: > On 1/6/2024 17:05, Jens Axboe wrote: >> On 6/1/24 8:19 AM, Jens Axboe wrote: >>> On 6/1/24 3:43 AM, Stefan wrote: >>>> io_uring uses the __u32 len field in order to pass the length to >>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>> 64bit platforms. >>>> >>>> When using liburing, the length is silently truncated to 32bits (so >>>> 8GB length would become zero, which has a different meaning of "until >>>> the end of the file" for fadvise). >>>> >>>> If my understanding is correct, we could fix this by introducing new >>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>> of the length field for length. >>> >>> We probably just want to introduce a flag and ensure that older stable >>> kernels check it, and then use a 64-bit field for it when the flag is >>> set. >> >> I think this should do it on the kernel side, as we already check these >> fields and return -EINVAL as needed. Should also be trivial to backport. >> Totally untested... Might want a FEAT flag for this, or something where >> it's detectable, to make the liburing change straight forward. >> >> >> diff --git a/io_uring/advise.c b/io_uring/advise.c >> index 7085804c513c..cb7b881665e5 100644 >> --- a/io_uring/advise.c >> +++ b/io_uring/advise.c >> @@ -17,14 +17,14 @@ >> struct io_fadvise { >> struct file *file; >> u64 offset; >> - u32 len; >> + u64 len; >> u32 advice; >> }; >> struct io_madvise { >> struct file *file; >> u64 addr; >> - u32 len; >> + u64 len; >> u32 advice; >> }; >> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >> + if (sqe->buf_index || sqe->splice_fd_in) >> return -EINVAL; >> ma->addr = READ_ONCE(sqe->addr); >> - ma->len = READ_ONCE(sqe->len); >> + ma->len = READ_ONCE(sqe->off); >> + if (!ma->len) >> + ma->len = READ_ONCE(sqe->len); >> ma->advice = READ_ONCE(sqe->fadvise_advice); >> req->flags |= REQ_F_FORCE_ASYNC; >> return 0; >> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >> { >> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >> + if (sqe->buf_index || sqe->splice_fd_in) >> return -EINVAL; >> fa->offset = READ_ONCE(sqe->off); >> - fa->len = READ_ONCE(sqe->len); >> + fa->len = READ_ONCE(sqe->addr); >> + if (!fa->len) >> + fa->len = READ_ONCE(sqe->len); >> fa->advice = READ_ONCE(sqe->fadvise_advice); >> if (io_fadvise_force_async(fa)) >> req->flags |= REQ_F_FORCE_ASYNC; >> > > > If we want to have the length in the same field in both *ADVISE > operations, we can put a flag in splice_fd_in/optlen. I don't think that part matters that much. > Maybe the explicit flag is a bit clearer for users of the API > compared to the implicit flag when setting sqe->len to zero? We could go either way. The unused fields returning -EINVAL if set right now can serve as the flag field - if you have it set, then that is your length. If not, then the old style is the length. That's the approach I took, rather than add an explicit flag to it. Existing users that would set the 64-bit length fields would get -EINVAL already. And since the normal flags field is already used for advice flags, I'd prefer just using the existing 64-bit zero fields for it rather than add a flag in an odd location. Would also make for an easier backport to stable. But don't feel that strongly about that part. Attached kernel patch with FEAT added, and liburing patch with 64 versions added. -- Jens Axboe [-- Attachment #2: kpatch.txt --] [-- Type: text/plain, Size: 2561 bytes --] diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 994bf7af0efe..73a8baba3428 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -540,6 +540,7 @@ struct io_uring_params { #define IORING_FEAT_LINKED_FILE (1U << 12) #define IORING_FEAT_REG_REG_RING (1U << 13) #define IORING_FEAT_RECVSEND_BUNDLE (1U << 14) +#define IORING_FEAT_64BIT_ADVISE (1U << 15) /* * io_uring_register(2) opcodes and arguments diff --git a/io_uring/advise.c b/io_uring/advise.c index 7085804c513c..cb7b881665e5 100644 --- a/io_uring/advise.c +++ b/io_uring/advise.c @@ -17,14 +17,14 @@ struct io_fadvise { struct file *file; u64 offset; - u32 len; + u64 len; u32 advice; }; struct io_madvise { struct file *file; u64 addr; - u32 len; + u64 len; u32 advice; }; @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) + if (sqe->buf_index || sqe->splice_fd_in) return -EINVAL; ma->addr = READ_ONCE(sqe->addr); - ma->len = READ_ONCE(sqe->len); + ma->len = READ_ONCE(sqe->off); + if (!ma->len) + ma->len = READ_ONCE(sqe->len); ma->advice = READ_ONCE(sqe->fadvise_advice); req->flags |= REQ_F_FORCE_ASYNC; return 0; @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) + if (sqe->buf_index || sqe->splice_fd_in) return -EINVAL; fa->offset = READ_ONCE(sqe->off); - fa->len = READ_ONCE(sqe->len); + fa->len = READ_ONCE(sqe->addr); + if (!fa->len) + fa->len = READ_ONCE(sqe->len); fa->advice = READ_ONCE(sqe->fadvise_advice); if (io_fadvise_force_async(fa)) req->flags |= REQ_F_FORCE_ASYNC; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 816e93e7f949..76896db7390d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3569,7 +3569,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS | IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP | IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING | - IORING_FEAT_RECVSEND_BUNDLE; + IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_64BIT_ADVISE; if (copy_to_user(params, p, sizeof(*p))) { ret = -EFAULT; [-- Attachment #3: liburing.txt --] [-- Type: text/plain, Size: 1687 bytes --] diff --git a/src/include/liburing.h b/src/include/liburing.h index 0a02364540a8..6f6d9db2f08a 100644 --- a/src/include/liburing.h +++ b/src/include/liburing.h @@ -764,6 +764,21 @@ IOURINGINLINE void io_uring_prep_madvise(struct io_uring_sqe *sqe, void *addr, sqe->fadvise_advice = (__u32) advice; } +IOURINGINLINE void io_uring_prep_fadvise64(struct io_uring_sqe *sqe, int fd, + __u64 offset, off_t len, int advice) +{ + io_uring_prep_rw(IORING_OP_FADVISE, sqe, fd, NULL, 0, offset); + sqe->addr = len; + sqe->fadvise_advice = (__u32) advice; +} + +IOURINGINLINE void io_uring_prep_madvise64(struct io_uring_sqe *sqe, void *addr, + off_t length, int advice) +{ + io_uring_prep_rw(IORING_OP_MADVISE, sqe, -1, addr, 0, length); + sqe->fadvise_advice = (__u32) advice; +} + IOURINGINLINE void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd, const void *buf, size_t len, int flags) { diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h index 9330733efd66..12fbfcf3e689 100644 --- a/src/include/liburing/io_uring.h +++ b/src/include/liburing/io_uring.h @@ -537,6 +537,7 @@ struct io_uring_params { #define IORING_FEAT_LINKED_FILE (1U << 12) #define IORING_FEAT_REG_REG_RING (1U << 13) #define IORING_FEAT_RECVSEND_BUNDLE (1U << 14) +#define IORING_FEAT_64BIT_ADVISE (1U << 15) /* * io_uring_register(2) opcodes and arguments diff --git a/src/liburing-ffi.map b/src/liburing-ffi.map index 3be48d02ac86..0e4bd9d1d78d 100644 --- a/src/liburing-ffi.map +++ b/src/liburing-ffi.map @@ -199,4 +199,6 @@ LIBURING_2.6 { } LIBURING_2.5; LIBURING_2.7 { + io_uring_prep_fadvise64; + io_uring_prep_madvise64; } LIBURING_2.6; ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 15:35 ` Jens Axboe @ 2024-06-01 15:51 ` Stefan 2024-06-01 18:33 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Stefan @ 2024-06-01 15:51 UTC (permalink / raw) To: Jens Axboe, io-uring On 1/6/2024 17:35, Jens Axboe wrote: > On 6/1/24 9:22 AM, Stefan wrote: >> On 1/6/2024 17:05, Jens Axboe wrote: >>> On 6/1/24 8:19 AM, Jens Axboe wrote: >>>> On 6/1/24 3:43 AM, Stefan wrote: >>>>> io_uring uses the __u32 len field in order to pass the length to >>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>>> 64bit platforms. >>>>> >>>>> When using liburing, the length is silently truncated to 32bits (so >>>>> 8GB length would become zero, which has a different meaning of "until >>>>> the end of the file" for fadvise). >>>>> >>>>> If my understanding is correct, we could fix this by introducing new >>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>>> of the length field for length. >>>> >>>> We probably just want to introduce a flag and ensure that older stable >>>> kernels check it, and then use a 64-bit field for it when the flag is >>>> set. >>> >>> I think this should do it on the kernel side, as we already check these >>> fields and return -EINVAL as needed. Should also be trivial to backport. >>> Totally untested... Might want a FEAT flag for this, or something where >>> it's detectable, to make the liburing change straight forward. >>> >>> >>> diff --git a/io_uring/advise.c b/io_uring/advise.c >>> index 7085804c513c..cb7b881665e5 100644 >>> --- a/io_uring/advise.c >>> +++ b/io_uring/advise.c >>> @@ -17,14 +17,14 @@ >>> struct io_fadvise { >>> struct file *file; >>> u64 offset; >>> - u32 len; >>> + u64 len; >>> u32 advice; >>> }; >>> struct io_madvise { >>> struct file *file; >>> u64 addr; >>> - u32 len; >>> + u64 len; >>> u32 advice; >>> }; >>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >>> + if (sqe->buf_index || sqe->splice_fd_in) >>> return -EINVAL; >>> ma->addr = READ_ONCE(sqe->addr); >>> - ma->len = READ_ONCE(sqe->len); >>> + ma->len = READ_ONCE(sqe->off); >>> + if (!ma->len) >>> + ma->len = READ_ONCE(sqe->len); >>> ma->advice = READ_ONCE(sqe->fadvise_advice); >>> req->flags |= REQ_F_FORCE_ASYNC; >>> return 0; >>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> { >>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >>> + if (sqe->buf_index || sqe->splice_fd_in) >>> return -EINVAL; >>> fa->offset = READ_ONCE(sqe->off); >>> - fa->len = READ_ONCE(sqe->len); >>> + fa->len = READ_ONCE(sqe->addr); >>> + if (!fa->len) >>> + fa->len = READ_ONCE(sqe->len); >>> fa->advice = READ_ONCE(sqe->fadvise_advice); >>> if (io_fadvise_force_async(fa)) >>> req->flags |= REQ_F_FORCE_ASYNC; >>> >> >> >> If we want to have the length in the same field in both *ADVISE >> operations, we can put a flag in splice_fd_in/optlen. > > I don't think that part matters that much. > >> Maybe the explicit flag is a bit clearer for users of the API >> compared to the implicit flag when setting sqe->len to zero? > > We could go either way. The unused fields returning -EINVAL if set right > now can serve as the flag field - if you have it set, then that is your > length. If not, then the old style is the length. That's the approach I > took, rather than add an explicit flag to it. Existing users that would > set the 64-bit length fields would get -EINVAL already. And since the > normal flags field is already used for advice flags, I'd prefer just > using the existing 64-bit zero fields for it rather than add a flag in > an odd location. Would also make for an easier backport to stable. > > But don't feel that strongly about that part. > > Attached kernel patch with FEAT added, and liburing patch with 64 > versions added. > Sounds good! Do we want to do anything about the current (32-bit) functions in liburing? They silently truncate the user's values, so either marking them deprecated or changing the type of length in the arguments to a __u32 could help. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 15:51 ` Stefan @ 2024-06-01 18:33 ` Jens Axboe 2024-06-02 8:58 ` Stefan 0 siblings, 1 reply; 10+ messages in thread From: Jens Axboe @ 2024-06-01 18:33 UTC (permalink / raw) To: Stefan, io-uring On 6/1/24 9:51 AM, Stefan wrote: > On 1/6/2024 17:35, Jens Axboe wrote: >> On 6/1/24 9:22 AM, Stefan wrote: >>> On 1/6/2024 17:05, Jens Axboe wrote: >>>> On 6/1/24 8:19 AM, Jens Axboe wrote: >>>>> On 6/1/24 3:43 AM, Stefan wrote: >>>>>> io_uring uses the __u32 len field in order to pass the length to >>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>>>> 64bit platforms. >>>>>> >>>>>> When using liburing, the length is silently truncated to 32bits (so >>>>>> 8GB length would become zero, which has a different meaning of "until >>>>>> the end of the file" for fadvise). >>>>>> >>>>>> If my understanding is correct, we could fix this by introducing new >>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>>>> of the length field for length. >>>>> >>>>> We probably just want to introduce a flag and ensure that older stable >>>>> kernels check it, and then use a 64-bit field for it when the flag is >>>>> set. >>>> >>>> I think this should do it on the kernel side, as we already check these >>>> fields and return -EINVAL as needed. Should also be trivial to backport. >>>> Totally untested... Might want a FEAT flag for this, or something where >>>> it's detectable, to make the liburing change straight forward. >>>> >>>> >>>> diff --git a/io_uring/advise.c b/io_uring/advise.c >>>> index 7085804c513c..cb7b881665e5 100644 >>>> --- a/io_uring/advise.c >>>> +++ b/io_uring/advise.c >>>> @@ -17,14 +17,14 @@ >>>> struct io_fadvise { >>>> struct file *file; >>>> u64 offset; >>>> - u32 len; >>>> + u64 len; >>>> u32 advice; >>>> }; >>>> struct io_madvise { >>>> struct file *file; >>>> u64 addr; >>>> - u32 len; >>>> + u64 len; >>>> u32 advice; >>>> }; >>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>> return -EINVAL; >>>> ma->addr = READ_ONCE(sqe->addr); >>>> - ma->len = READ_ONCE(sqe->len); >>>> + ma->len = READ_ONCE(sqe->off); >>>> + if (!ma->len) >>>> + ma->len = READ_ONCE(sqe->len); >>>> ma->advice = READ_ONCE(sqe->fadvise_advice); >>>> req->flags |= REQ_F_FORCE_ASYNC; >>>> return 0; >>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> { >>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>> return -EINVAL; >>>> fa->offset = READ_ONCE(sqe->off); >>>> - fa->len = READ_ONCE(sqe->len); >>>> + fa->len = READ_ONCE(sqe->addr); >>>> + if (!fa->len) >>>> + fa->len = READ_ONCE(sqe->len); >>>> fa->advice = READ_ONCE(sqe->fadvise_advice); >>>> if (io_fadvise_force_async(fa)) >>>> req->flags |= REQ_F_FORCE_ASYNC; >>>> >>> >>> >>> If we want to have the length in the same field in both *ADVISE >>> operations, we can put a flag in splice_fd_in/optlen. >> >> I don't think that part matters that much. >> >>> Maybe the explicit flag is a bit clearer for users of the API >>> compared to the implicit flag when setting sqe->len to zero? >> >> We could go either way. The unused fields returning -EINVAL if set right >> now can serve as the flag field - if you have it set, then that is your >> length. If not, then the old style is the length. That's the approach I >> took, rather than add an explicit flag to it. Existing users that would >> set the 64-bit length fields would get -EINVAL already. And since the >> normal flags field is already used for advice flags, I'd prefer just >> using the existing 64-bit zero fields for it rather than add a flag in >> an odd location. Would also make for an easier backport to stable. >> >> But don't feel that strongly about that part. >> >> Attached kernel patch with FEAT added, and liburing patch with 64 >> versions added. >> > > Sounds good! > Do we want to do anything about the current (32-bit) functions in > liburing? They silently truncate the user's values, so either marking > them deprecated or changing the type of length in the arguments to a > __u32 could help. I like changing it to an __u32, and then we'll add a note to the man page for them as well (with references to the 64-bit variants). I still need to write a test and actually test the patches, but I'll get to that Monday. If you want to write a test case that checks the 64-bit range, then please do! -- Jens Axboe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-01 18:33 ` Jens Axboe @ 2024-06-02 8:58 ` Stefan 2024-06-02 14:49 ` Jens Axboe 0 siblings, 1 reply; 10+ messages in thread From: Stefan @ 2024-06-02 8:58 UTC (permalink / raw) To: Jens Axboe, io-uring On 1/6/2024 20:33, Jens Axboe wrote: > On 6/1/24 9:51 AM, Stefan wrote: >> On 1/6/2024 17:35, Jens Axboe wrote: >>> On 6/1/24 9:22 AM, Stefan wrote: >>>> On 1/6/2024 17:05, Jens Axboe wrote: >>>>> On 6/1/24 8:19 AM, Jens Axboe wrote: >>>>>> On 6/1/24 3:43 AM, Stefan wrote: >>>>>>> io_uring uses the __u32 len field in order to pass the length to >>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>>>>> 64bit platforms. >>>>>>> >>>>>>> When using liburing, the length is silently truncated to 32bits (so >>>>>>> 8GB length would become zero, which has a different meaning of "until >>>>>>> the end of the file" for fadvise). >>>>>>> >>>>>>> If my understanding is correct, we could fix this by introducing new >>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>>>>> of the length field for length. >>>>>> >>>>>> We probably just want to introduce a flag and ensure that older stable >>>>>> kernels check it, and then use a 64-bit field for it when the flag is >>>>>> set. >>>>> >>>>> I think this should do it on the kernel side, as we already check these >>>>> fields and return -EINVAL as needed. Should also be trivial to backport. >>>>> Totally untested... Might want a FEAT flag for this, or something where >>>>> it's detectable, to make the liburing change straight forward. >>>>> >>>>> >>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c >>>>> index 7085804c513c..cb7b881665e5 100644 >>>>> --- a/io_uring/advise.c >>>>> +++ b/io_uring/advise.c >>>>> @@ -17,14 +17,14 @@ >>>>> struct io_fadvise { >>>>> struct file *file; >>>>> u64 offset; >>>>> - u32 len; >>>>> + u64 len; >>>>> u32 advice; >>>>> }; >>>>> struct io_madvise { >>>>> struct file *file; >>>>> u64 addr; >>>>> - u32 len; >>>>> + u64 len; >>>>> u32 advice; >>>>> }; >>>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >>>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >>>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>> return -EINVAL; >>>>> ma->addr = READ_ONCE(sqe->addr); >>>>> - ma->len = READ_ONCE(sqe->len); >>>>> + ma->len = READ_ONCE(sqe->off); >>>>> + if (!ma->len) >>>>> + ma->len = READ_ONCE(sqe->len); >>>>> ma->advice = READ_ONCE(sqe->fadvise_advice); >>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>> return 0; >>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>> { >>>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >>>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>> return -EINVAL; >>>>> fa->offset = READ_ONCE(sqe->off); >>>>> - fa->len = READ_ONCE(sqe->len); >>>>> + fa->len = READ_ONCE(sqe->addr); >>>>> + if (!fa->len) >>>>> + fa->len = READ_ONCE(sqe->len); >>>>> fa->advice = READ_ONCE(sqe->fadvise_advice); >>>>> if (io_fadvise_force_async(fa)) >>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>> >>>> >>>> >>>> If we want to have the length in the same field in both *ADVISE >>>> operations, we can put a flag in splice_fd_in/optlen. >>> >>> I don't think that part matters that much. >>> >>>> Maybe the explicit flag is a bit clearer for users of the API >>>> compared to the implicit flag when setting sqe->len to zero? >>> >>> We could go either way. The unused fields returning -EINVAL if set right >>> now can serve as the flag field - if you have it set, then that is your >>> length. If not, then the old style is the length. That's the approach I >>> took, rather than add an explicit flag to it. Existing users that would >>> set the 64-bit length fields would get -EINVAL already. And since the >>> normal flags field is already used for advice flags, I'd prefer just >>> using the existing 64-bit zero fields for it rather than add a flag in >>> an odd location. Would also make for an easier backport to stable. >>> >>> But don't feel that strongly about that part. >>> >>> Attached kernel patch with FEAT added, and liburing patch with 64 >>> versions added. >>> >> >> Sounds good! >> Do we want to do anything about the current (32-bit) functions in >> liburing? They silently truncate the user's values, so either marking >> them deprecated or changing the type of length in the arguments to a >> __u32 could help. > > I like changing it to an __u32, and then we'll add a note to the man > page for them as well (with references to the 64-bit variants). > > I still need to write a test and actually test the patches, but I'll get > to that Monday. If you want to write a test case that checks the 64-bit > range, then please do! > Maybe something like the following for madvise? Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) hole using MADV_REMOVE, and check the contents. It requires support for FALLOC_FL_PUNCH_HOLE in the filesystem. diff --git a/test/helpers.c b/test/helpers.c index 779347f..acf1c7d 100644 --- a/test/helpers.c +++ b/test/helpers.c @@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size) */ static void __t_create_file(const char *file, size_t size, char pattern) { - ssize_t ret; - char *buf; + ssize_t ret = 0; + size_t size_remaining; + char *buf, *buf_loc; int fd; buf = t_malloc(size); @@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t size, char pattern) fd = open(file, O_WRONLY | O_CREAT, 0644); assert(fd >= 0); - ret = write(fd, buf, size); + size_remaining = size; + buf_loc = buf; + while (size_remaining > 0) { + ret = write(fd, buf_loc, size_remaining); + if (ret <= 0) + break; + size_remaining -= ret; + buf_loc += ret; + } fsync(fd); close(fd); free(buf); - assert(ret == size); + assert(size_remaining == 0); } void t_create_file(const char *file, size_t size) diff --git a/test/madvise.c b/test/madvise.c index 7938ec4..b5b0cbe 100644 --- a/test/madvise.c +++ b/test/madvise.c @@ -15,34 +15,7 @@ #include "helpers.h" #include "liburing.h" -#define FILE_SIZE (128 * 1024) - -#define LOOPS 100 -#define MIN_LOOPS 10 - -static unsigned long long utime_since(const struct timeval *s, - const struct timeval *e) -{ - long long sec, usec; - - sec = e->tv_sec - s->tv_sec; - usec = (e->tv_usec - s->tv_usec); - if (sec > 0 && usec < 0) { - sec--; - usec += 1000000; - } - - sec *= 1000000; - return sec + usec; -} - -static unsigned long long utime_since_now(struct timeval *tv) -{ - struct timeval end; - - gettimeofday(&end, NULL); - return utime_since(tv, &end); -} +#define FILE_SIZE (8ULL * 1024ULL * 1024ULL * 1024ULL) static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice) { @@ -76,83 +49,62 @@ static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice) unlink(".madvise.tmp"); exit(0); } else if (ret) { - fprintf(stderr, "cqe->res=%d\n", cqe->res); + fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res, + strerror(-cqe->res)); } io_uring_cqe_seen(ring, cqe); return ret; } -static long do_copy(int fd, char *buf, void *ptr) -{ - struct timeval tv; - - gettimeofday(&tv, NULL); - memcpy(buf, ptr, FILE_SIZE); - return utime_since_now(&tv); -} - static int test_madvise(struct io_uring *ring, const char *filename) { - unsigned long cached_read, uncached_read, cached_read2; + size_t page_size; + unsigned char contents; int fd, ret; - char *buf; - void *ptr; + unsigned char *ptr; + + page_size = sysconf(_SC_PAGE_SIZE); - fd = open(filename, O_RDONLY); + fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return 1; } - buf = t_malloc(FILE_SIZE); - - ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0); + ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (ptr == MAP_FAILED) { perror("mmap"); return 1; } - cached_read = do_copy(fd, buf, ptr); - if (cached_read == -1) - return 1; - - cached_read = do_copy(fd, buf, ptr); - if (cached_read == -1) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED); - if (ret) - return 1; - - uncached_read = do_copy(fd, buf, ptr); - if (uncached_read == -1) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED); - if (ret) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED); + ret = + do_madvise(ring, ptr + page_size, FILE_SIZE - page_size, + MADV_REMOVE); if (ret) return 1; - msync(ptr, FILE_SIZE, MS_SYNC); - - cached_read2 = do_copy(fd, buf, ptr); - if (cached_read2 == -1) - return 1; - - if (cached_read < uncached_read && - cached_read2 < uncached_read) - return 0; + for (size_t i = 0; i < FILE_SIZE; i++) { + contents = ptr[i]; + if (contents && i > page_size) { + fprintf(stderr, + "In removed page at %lu but contents=%x\n", i, + contents); + return 2; + } else if (contents != 0xaa && i < page_size) { + fprintf(stderr, + "In non-removed page at %lu but contents=%x\n", + i, contents); + return 2; + } + } - return 2; + return 0; } int main(int argc, char *argv[]) { struct io_uring ring; - int ret, i, good, bad; + int ret; char *fname; if (argc > 1) { @@ -167,23 +119,12 @@ int main(int argc, char *argv[]) goto err; } - good = bad = 0; - for (i = 0; i < LOOPS; i++) { - ret = test_madvise(&ring, fname); - if (ret == 1) { - fprintf(stderr, "test_madvise failed\n"); - goto err; - } else if (!ret) - good++; - else if (ret == 2) - bad++; - if (i >= MIN_LOOPS && !bad) - break; + ret = test_madvise(&ring, fname); + if (ret) { + fprintf(stderr, "test_madvise failed\n"); + goto err; } - /* too hard to reliably test, just ignore */ - if ((0) && bad > good) - fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good); if (fname != argv[1]) unlink(fname); io_uring_queue_exit(&ring); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-02 8:58 ` Stefan @ 2024-06-02 14:49 ` Jens Axboe 2024-06-05 5:25 ` Stefan 0 siblings, 1 reply; 10+ messages in thread From: Jens Axboe @ 2024-06-02 14:49 UTC (permalink / raw) To: Stefan, io-uring On 6/2/24 2:58 AM, Stefan wrote: > On 1/6/2024 20:33, Jens Axboe wrote: >> On 6/1/24 9:51 AM, Stefan wrote: >>> On 1/6/2024 17:35, Jens Axboe wrote: >>>> On 6/1/24 9:22 AM, Stefan wrote: >>>>> On 1/6/2024 17:05, Jens Axboe wrote: >>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote: >>>>>>> On 6/1/24 3:43 AM, Stefan wrote: >>>>>>>> io_uring uses the __u32 len field in order to pass the length to >>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>>>>>> 64bit platforms. >>>>>>>> >>>>>>>> When using liburing, the length is silently truncated to 32bits (so >>>>>>>> 8GB length would become zero, which has a different meaning of "until >>>>>>>> the end of the file" for fadvise). >>>>>>>> >>>>>>>> If my understanding is correct, we could fix this by introducing new >>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>>>>>> of the length field for length. >>>>>>> >>>>>>> We probably just want to introduce a flag and ensure that older stable >>>>>>> kernels check it, and then use a 64-bit field for it when the flag is >>>>>>> set. >>>>>> >>>>>> I think this should do it on the kernel side, as we already check these >>>>>> fields and return -EINVAL as needed. Should also be trivial to backport. >>>>>> Totally untested... Might want a FEAT flag for this, or something where >>>>>> it's detectable, to make the liburing change straight forward. >>>>>> >>>>>> >>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c >>>>>> index 7085804c513c..cb7b881665e5 100644 >>>>>> --- a/io_uring/advise.c >>>>>> +++ b/io_uring/advise.c >>>>>> @@ -17,14 +17,14 @@ >>>>>> struct io_fadvise { >>>>>> struct file *file; >>>>>> u64 offset; >>>>>> - u32 len; >>>>>> + u64 len; >>>>>> u32 advice; >>>>>> }; >>>>>> struct io_madvise { >>>>>> struct file *file; >>>>>> u64 addr; >>>>>> - u32 len; >>>>>> + u64 len; >>>>>> u32 advice; >>>>>> }; >>>>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >>>>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >>>>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >>>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>>> return -EINVAL; >>>>>> ma->addr = READ_ONCE(sqe->addr); >>>>>> - ma->len = READ_ONCE(sqe->len); >>>>>> + ma->len = READ_ONCE(sqe->off); >>>>>> + if (!ma->len) >>>>>> + ma->len = READ_ONCE(sqe->len); >>>>>> ma->advice = READ_ONCE(sqe->fadvise_advice); >>>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>>> return 0; >>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>> { >>>>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >>>>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >>>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>>> return -EINVAL; >>>>>> fa->offset = READ_ONCE(sqe->off); >>>>>> - fa->len = READ_ONCE(sqe->len); >>>>>> + fa->len = READ_ONCE(sqe->addr); >>>>>> + if (!fa->len) >>>>>> + fa->len = READ_ONCE(sqe->len); >>>>>> fa->advice = READ_ONCE(sqe->fadvise_advice); >>>>>> if (io_fadvise_force_async(fa)) >>>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>>> >>>>> >>>>> >>>>> If we want to have the length in the same field in both *ADVISE >>>>> operations, we can put a flag in splice_fd_in/optlen. >>>> >>>> I don't think that part matters that much. >>>> >>>>> Maybe the explicit flag is a bit clearer for users of the API >>>>> compared to the implicit flag when setting sqe->len to zero? >>>> >>>> We could go either way. The unused fields returning -EINVAL if set right >>>> now can serve as the flag field - if you have it set, then that is your >>>> length. If not, then the old style is the length. That's the approach I >>>> took, rather than add an explicit flag to it. Existing users that would >>>> set the 64-bit length fields would get -EINVAL already. And since the >>>> normal flags field is already used for advice flags, I'd prefer just >>>> using the existing 64-bit zero fields for it rather than add a flag in >>>> an odd location. Would also make for an easier backport to stable. >>>> >>>> But don't feel that strongly about that part. >>>> >>>> Attached kernel patch with FEAT added, and liburing patch with 64 >>>> versions added. >>>> >>> >>> Sounds good! >>> Do we want to do anything about the current (32-bit) functions in >>> liburing? They silently truncate the user's values, so either marking >>> them deprecated or changing the type of length in the arguments to a >>> __u32 could help. >> >> I like changing it to an __u32, and then we'll add a note to the man >> page for them as well (with references to the 64-bit variants). >> >> I still need to write a test and actually test the patches, but I'll get >> to that Monday. If you want to write a test case that checks the 64-bit >> range, then please do! >> > > Maybe something like the following for madvise? > Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) > hole using MADV_REMOVE, and check the contents. It requires support > for FALLOC_FL_PUNCH_HOLE in the filesystem. I think that looks very reasonable, and it's better than the DONTNEED and timings, it was always a pretty shitty test. We just need to ensure that we return T_EXIT_SKIP if the fs it's being run on doesn't support punching holes. FWIW, I did put the liburing changes in an 'advise' branch, so you could generate a patch against that. Once we're happy with it, it can get pulled into master. -- Jens Axboe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: madvise/fadvise 32-bit length 2024-06-02 14:49 ` Jens Axboe @ 2024-06-05 5:25 ` Stefan 0 siblings, 0 replies; 10+ messages in thread From: Stefan @ 2024-06-05 5:25 UTC (permalink / raw) To: Jens Axboe, io-uring [-- Attachment #1: Type: text/plain, Size: 6150 bytes --] On 2/6/2024 16:49, Jens Axboe wrote: > On 6/2/24 2:58 AM, Stefan wrote: >> On 1/6/2024 20:33, Jens Axboe wrote: >>> On 6/1/24 9:51 AM, Stefan wrote: >>>> On 1/6/2024 17:35, Jens Axboe wrote: >>>>> On 6/1/24 9:22 AM, Stefan wrote: >>>>>> On 1/6/2024 17:05, Jens Axboe wrote: >>>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote: >>>>>>>> On 6/1/24 3:43 AM, Stefan wrote: >>>>>>>>> io_uring uses the __u32 len field in order to pass the length to >>>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on >>>>>>>>> 64bit platforms. >>>>>>>>> >>>>>>>>> When using liburing, the length is silently truncated to 32bits (so >>>>>>>>> 8GB length would become zero, which has a different meaning of "until >>>>>>>>> the end of the file" for fadvise). >>>>>>>>> >>>>>>>>> If my understanding is correct, we could fix this by introducing new >>>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead >>>>>>>>> of the length field for length. >>>>>>>> >>>>>>>> We probably just want to introduce a flag and ensure that older stable >>>>>>>> kernels check it, and then use a 64-bit field for it when the flag is >>>>>>>> set. >>>>>>> >>>>>>> I think this should do it on the kernel side, as we already check these >>>>>>> fields and return -EINVAL as needed. Should also be trivial to backport. >>>>>>> Totally untested... Might want a FEAT flag for this, or something where >>>>>>> it's detectable, to make the liburing change straight forward. >>>>>>> >>>>>>> >>>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c >>>>>>> index 7085804c513c..cb7b881665e5 100644 >>>>>>> --- a/io_uring/advise.c >>>>>>> +++ b/io_uring/advise.c >>>>>>> @@ -17,14 +17,14 @@ >>>>>>> struct io_fadvise { >>>>>>> struct file *file; >>>>>>> u64 offset; >>>>>>> - u32 len; >>>>>>> + u64 len; >>>>>>> u32 advice; >>>>>>> }; >>>>>>> struct io_madvise { >>>>>>> struct file *file; >>>>>>> u64 addr; >>>>>>> - u32 len; >>>>>>> + u64 len; >>>>>>> u32 advice; >>>>>>> }; >>>>>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU) >>>>>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise); >>>>>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in) >>>>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>>>> return -EINVAL; >>>>>>> ma->addr = READ_ONCE(sqe->addr); >>>>>>> - ma->len = READ_ONCE(sqe->len); >>>>>>> + ma->len = READ_ONCE(sqe->off); >>>>>>> + if (!ma->len) >>>>>>> + ma->len = READ_ONCE(sqe->len); >>>>>>> ma->advice = READ_ONCE(sqe->fadvise_advice); >>>>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>>>> return 0; >>>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>> { >>>>>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise); >>>>>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in) >>>>>>> + if (sqe->buf_index || sqe->splice_fd_in) >>>>>>> return -EINVAL; >>>>>>> fa->offset = READ_ONCE(sqe->off); >>>>>>> - fa->len = READ_ONCE(sqe->len); >>>>>>> + fa->len = READ_ONCE(sqe->addr); >>>>>>> + if (!fa->len) >>>>>>> + fa->len = READ_ONCE(sqe->len); >>>>>>> fa->advice = READ_ONCE(sqe->fadvise_advice); >>>>>>> if (io_fadvise_force_async(fa)) >>>>>>> req->flags |= REQ_F_FORCE_ASYNC; >>>>>>> >>>>>> >>>>>> >>>>>> If we want to have the length in the same field in both *ADVISE >>>>>> operations, we can put a flag in splice_fd_in/optlen. >>>>> >>>>> I don't think that part matters that much. >>>>> >>>>>> Maybe the explicit flag is a bit clearer for users of the API >>>>>> compared to the implicit flag when setting sqe->len to zero? >>>>> >>>>> We could go either way. The unused fields returning -EINVAL if set right >>>>> now can serve as the flag field - if you have it set, then that is your >>>>> length. If not, then the old style is the length. That's the approach I >>>>> took, rather than add an explicit flag to it. Existing users that would >>>>> set the 64-bit length fields would get -EINVAL already. And since the >>>>> normal flags field is already used for advice flags, I'd prefer just >>>>> using the existing 64-bit zero fields for it rather than add a flag in >>>>> an odd location. Would also make for an easier backport to stable. >>>>> >>>>> But don't feel that strongly about that part. >>>>> >>>>> Attached kernel patch with FEAT added, and liburing patch with 64 >>>>> versions added. >>>>> >>>> >>>> Sounds good! >>>> Do we want to do anything about the current (32-bit) functions in >>>> liburing? They silently truncate the user's values, so either marking >>>> them deprecated or changing the type of length in the arguments to a >>>> __u32 could help. >>> >>> I like changing it to an __u32, and then we'll add a note to the man >>> page for them as well (with references to the 64-bit variants). >>> >>> I still need to write a test and actually test the patches, but I'll get >>> to that Monday. If you want to write a test case that checks the 64-bit >>> range, then please do! >>> >> >> Maybe something like the following for madvise? >> Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) >> hole using MADV_REMOVE, and check the contents. It requires support >> for FALLOC_FL_PUNCH_HOLE in the filesystem. > > I think that looks very reasonable, and it's better than the DONTNEED > and timings, it was always a pretty shitty test. We just need to ensure > that we return T_EXIT_SKIP if the fs it's being run on doesn't support > punching holes. > > FWIW, I did put the liburing changes in an 'advise' branch, so you could > generate a patch against that. Once we're happy with it, it can get > pulled into master. > Here's the patch against your branch. [-- Attachment #2: 0001-Change-madvise-test-to-use-MADV_REMOVE.patch --] [-- Type: text/x-patch, Size: 7485 bytes --] From 1570cfdc494c9f38bac84f7fc837f69f99888ac0 Mon Sep 17 00:00:00 2001 From: Stefan Muenzel <[email protected]> Date: Mon, 3 Jun 2024 18:40:27 +0200 Subject: [PATCH] Change madvise test to use MADV_REMOVE Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) hole using MADV_REMOVE, and check the contents. Requires support for FALLOC_FL_PUNCH_HOLE in the filesystem, which is checked for Signed-off-by: Stefan Muenzel <[email protected]> --- test/helpers.c | 39 +++++++++------ test/madvise.c | 131 +++++++++++++++---------------------------------- 2 files changed, 64 insertions(+), 106 deletions(-) diff --git a/test/helpers.c b/test/helpers.c index 779347f..ff5273d 100644 --- a/test/helpers.c +++ b/test/helpers.c @@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size) */ static void __t_create_file(const char *file, size_t size, char pattern) { - ssize_t ret; - char *buf; + ssize_t ret = 0; + size_t size_remaining; + char *buf, *buf_loc; int fd; buf = t_malloc(size); @@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t size, char pattern) fd = open(file, O_WRONLY | O_CREAT, 0644); assert(fd >= 0); - ret = write(fd, buf, size); + size_remaining = size; + buf_loc = buf; + while (size_remaining > 0) { + ret = write(fd, buf_loc, size_remaining); + if (ret <= 0) + break; + size_remaining -= ret; + buf_loc += ret; + } fsync(fd); close(fd); free(buf); - assert(ret == size); + assert(size_remaining == 0); } void t_create_file(const char *file, size_t size) @@ -264,7 +273,7 @@ bool t_probe_defer_taskrun(void) int ret; ret = io_uring_queue_init(1, &ring, IORING_SETUP_SINGLE_ISSUER | - IORING_SETUP_DEFER_TASKRUN); + IORING_SETUP_DEFER_TASKRUN); if (ret < 0) return false; io_uring_queue_exit(&ring); @@ -291,12 +300,12 @@ unsigned __io_uring_flush_sq(struct io_uring *ring) io_uring_smp_store_release(sq->ktail, tail); } /* - * This load needs to be atomic, since sq->khead is written concurrently - * by the kernel, but it doesn't need to be load_acquire, since the - * kernel doesn't store to the submission queue; it advances khead just - * to indicate that it's finished reading the submission queue entries - * so they're available for us to write to. - */ + * This load needs to be atomic, since sq->khead is written concurrently + * by the kernel, but it doesn't need to be load_acquire, since the + * kernel doesn't store to the submission queue; it advances khead just + * to indicate that it's finished reading the submission queue entries + * so they're available for us to write to. + */ return tail - IO_URING_READ_ONCE(*sq->khead); } @@ -306,13 +315,13 @@ unsigned __io_uring_flush_sq(struct io_uring *ring) void t_error(int status, int errnum, const char *format, ...) { va_list args; - va_start(args, format); + va_start(args, format); vfprintf(stderr, format, args); - if (errnum) - fprintf(stderr, ": %s", strerror(errnum)); + if (errnum) + fprintf(stderr, ": %s", strerror(errnum)); fprintf(stderr, "\n"); va_end(args); - exit(status); + exit(status); } diff --git a/test/madvise.c b/test/madvise.c index 7938ec4..cf5cf4c 100644 --- a/test/madvise.c +++ b/test/madvise.c @@ -15,34 +15,7 @@ #include "helpers.h" #include "liburing.h" -#define FILE_SIZE (128 * 1024) - -#define LOOPS 100 -#define MIN_LOOPS 10 - -static unsigned long long utime_since(const struct timeval *s, - const struct timeval *e) -{ - long long sec, usec; - - sec = e->tv_sec - s->tv_sec; - usec = (e->tv_usec - s->tv_usec); - if (sec > 0 && usec < 0) { - sec--; - usec += 1000000; - } - - sec *= 1000000; - return sec + usec; -} - -static unsigned long long utime_since_now(struct timeval *tv) -{ - struct timeval end; - - gettimeofday(&end, NULL); - return utime_since(tv, &end); -} +#define FILE_SIZE (8ULL * 1024ULL * 1024ULL * 1024ULL) static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice) { @@ -76,83 +49,68 @@ static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice) unlink(".madvise.tmp"); exit(0); } else if (ret) { - fprintf(stderr, "cqe->res=%d\n", cqe->res); + fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res, + strerror(-cqe->res)); } io_uring_cqe_seen(ring, cqe); return ret; } -static long do_copy(int fd, char *buf, void *ptr) -{ - struct timeval tv; - - gettimeofday(&tv, NULL); - memcpy(buf, ptr, FILE_SIZE); - return utime_since_now(&tv); -} - static int test_madvise(struct io_uring *ring, const char *filename) { - unsigned long cached_read, uncached_read, cached_read2; + size_t page_size; + unsigned char contents; int fd, ret; - char *buf; - void *ptr; + unsigned char *ptr; - fd = open(filename, O_RDONLY); + page_size = sysconf(_SC_PAGE_SIZE); + + fd = open(filename, O_RDWR); if (fd < 0) { perror("open"); return 1; } - buf = t_malloc(FILE_SIZE); + ret = + fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, page_size, + page_size); + if (ret == -1 && errno == EOPNOTSUPP) + return 3; - ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0); + ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (ptr == MAP_FAILED) { perror("mmap"); return 1; } - cached_read = do_copy(fd, buf, ptr); - if (cached_read == -1) - return 1; - - cached_read = do_copy(fd, buf, ptr); - if (cached_read == -1) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED); + ret = + do_madvise(ring, ptr + 2 * page_size, FILE_SIZE - page_size, + MADV_REMOVE); if (ret) return 1; - uncached_read = do_copy(fd, buf, ptr); - if (uncached_read == -1) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED); - if (ret) - return 1; - - ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED); - if (ret) - return 1; - - msync(ptr, FILE_SIZE, MS_SYNC); - - cached_read2 = do_copy(fd, buf, ptr); - if (cached_read2 == -1) - return 1; - - if (cached_read < uncached_read && - cached_read2 < uncached_read) - return 0; + for (size_t i = 0; i < FILE_SIZE; i++) { + contents = ptr[i]; + if (contents && i > page_size) { + fprintf(stderr, + "In removed page at %lu but contents=%x\n", i, + contents); + return 2; + } else if (contents != 0xaa && i < page_size) { + fprintf(stderr, + "In non-removed page at %lu but contents=%x\n", + i, contents); + return 2; + } + } - return 2; + return 0; } int main(int argc, char *argv[]) { struct io_uring ring; - int ret, i, good, bad; + int ret = 0; char *fname; if (argc > 1) { @@ -167,23 +125,12 @@ int main(int argc, char *argv[]) goto err; } - good = bad = 0; - for (i = 0; i < LOOPS; i++) { - ret = test_madvise(&ring, fname); - if (ret == 1) { - fprintf(stderr, "test_madvise failed\n"); - goto err; - } else if (!ret) - good++; - else if (ret == 2) - bad++; - if (i >= MIN_LOOPS && !bad) - break; + ret = test_madvise(&ring, fname); + if (ret) { + fprintf(stderr, "test_madvise failed\n"); + goto err; } - /* too hard to reliably test, just ignore */ - if ((0) && bad > good) - fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good); if (fname != argv[1]) unlink(fname); io_uring_queue_exit(&ring); @@ -191,5 +138,7 @@ int main(int argc, char *argv[]) err: if (fname != argv[1]) unlink(fname); + if (ret == 3) + return T_EXIT_SKIP; return T_EXIT_FAIL; } -- 2.45.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-06-05 5:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-01 9:43 madvise/fadvise 32-bit length Stefan 2024-06-01 14:19 ` Jens Axboe 2024-06-01 15:05 ` Jens Axboe 2024-06-01 15:22 ` Stefan 2024-06-01 15:35 ` Jens Axboe 2024-06-01 15:51 ` Stefan 2024-06-01 18:33 ` Jens Axboe 2024-06-02 8:58 ` Stefan 2024-06-02 14:49 ` Jens Axboe 2024-06-05 5:25 ` Stefan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox