public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Stefan <[email protected]>, [email protected]
Subject: Re: madvise/fadvise 32-bit length
Date: Sat, 1 Jun 2024 09:35:49 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

[-- Attachment #1: Type: text/plain, Size: 4098 bytes --]

On 6/1/24 9:22 AM, Stefan wrote:
> On 1/6/2024 17:05, Jens Axboe wrote:
>> On 6/1/24 8:19 AM, Jens Axboe wrote:
>>> On 6/1/24 3:43 AM, Stefan wrote:
>>>> io_uring uses the __u32 len field in order to pass the length to
>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on
>>>> 64bit platforms.
>>>>
>>>> When using liburing, the length is silently truncated to 32bits (so
>>>> 8GB length would become zero, which has a different meaning of "until
>>>> the end of the file" for fadvise).
>>>>
>>>> If my understanding is correct, we could fix this by introducing new
>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead
>>>> of the length field for length.
>>>
>>> We probably just want to introduce a flag and ensure that older stable
>>> kernels check it, and then use a 64-bit field for it when the flag is
>>> set.
>>
>> I think this should do it on the kernel side, as we already check these
>> fields and return -EINVAL as needed. Should also be trivial to backport.
>> Totally untested... Might want a FEAT flag for this, or something where
>> it's detectable, to make the liburing change straight forward.
>>
>>
>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>> index 7085804c513c..cb7b881665e5 100644
>> --- a/io_uring/advise.c
>> +++ b/io_uring/advise.c
>> @@ -17,14 +17,14 @@
>>   struct io_fadvise {
>>       struct file            *file;
>>       u64                offset;
>> -    u32                len;
>> +    u64                len;
>>       u32                advice;
>>   };
>>     struct io_madvise {
>>       struct file            *file;
>>       u64                addr;
>> -    u32                len;
>> +    u64                len;
>>       u32                advice;
>>   };
>>   @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
>>       struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
>>   -    if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>           return -EINVAL;
>>         ma->addr = READ_ONCE(sqe->addr);
>> -    ma->len = READ_ONCE(sqe->len);
>> +    ma->len = READ_ONCE(sqe->off);
>> +    if (!ma->len)
>> +        ma->len = READ_ONCE(sqe->len);
>>       ma->advice = READ_ONCE(sqe->fadvise_advice);
>>       req->flags |= REQ_F_FORCE_ASYNC;
>>       return 0;
>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>   {
>>       struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
>>   -    if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>           return -EINVAL;
>>         fa->offset = READ_ONCE(sqe->off);
>> -    fa->len = READ_ONCE(sqe->len);
>> +    fa->len = READ_ONCE(sqe->addr);
>> +    if (!fa->len)
>> +        fa->len = READ_ONCE(sqe->len);
>>       fa->advice = READ_ONCE(sqe->fadvise_advice);
>>       if (io_fadvise_force_async(fa))
>>           req->flags |= REQ_F_FORCE_ASYNC;
>>
> 
> 
> If we want to have the length in the same field in both *ADVISE
> operations, we can put a flag in splice_fd_in/optlen.

I don't think that part matters that much.

> Maybe the explicit flag is a bit clearer for users of the API
> compared to the implicit flag when setting sqe->len to zero?

We could go either way. The unused fields returning -EINVAL if set right
now can serve as the flag field - if you have it set, then that is your
length. If not, then the old style is the length. That's the approach I
took, rather than add an explicit flag to it. Existing users that would
set the 64-bit length fields would get -EINVAL already. And since the
normal flags field is already used for advice flags, I'd prefer just
using the existing 64-bit zero fields for it rather than add a flag in
an odd location. Would also make for an easier backport to stable.

But don't feel that strongly about that part.

Attached kernel patch with FEAT added, and liburing patch with 64
versions added.

-- 
Jens Axboe

[-- Attachment #2: kpatch.txt --]
[-- Type: text/plain, Size: 2561 bytes --]

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 994bf7af0efe..73a8baba3428 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -540,6 +540,7 @@ struct io_uring_params {
 #define IORING_FEAT_LINKED_FILE		(1U << 12)
 #define IORING_FEAT_REG_REG_RING	(1U << 13)
 #define IORING_FEAT_RECVSEND_BUNDLE	(1U << 14)
+#define IORING_FEAT_64BIT_ADVISE	(1U << 15)
 
 /*
  * io_uring_register(2) opcodes and arguments
diff --git a/io_uring/advise.c b/io_uring/advise.c
index 7085804c513c..cb7b881665e5 100644
--- a/io_uring/advise.c
+++ b/io_uring/advise.c
@@ -17,14 +17,14 @@
 struct io_fadvise {
 	struct file			*file;
 	u64				offset;
-	u32				len;
+	u64				len;
 	u32				advice;
 };
 
 struct io_madvise {
 	struct file			*file;
 	u64				addr;
-	u32				len;
+	u64				len;
 	u32				advice;
 };
 
@@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
 	struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
 
-	if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
+	if (sqe->buf_index || sqe->splice_fd_in)
 		return -EINVAL;
 
 	ma->addr = READ_ONCE(sqe->addr);
-	ma->len = READ_ONCE(sqe->len);
+	ma->len = READ_ONCE(sqe->off);
+	if (!ma->len)
+		ma->len = READ_ONCE(sqe->len);
 	ma->advice = READ_ONCE(sqe->fadvise_advice);
 	req->flags |= REQ_F_FORCE_ASYNC;
 	return 0;
@@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
 
-	if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
+	if (sqe->buf_index || sqe->splice_fd_in)
 		return -EINVAL;
 
 	fa->offset = READ_ONCE(sqe->off);
-	fa->len = READ_ONCE(sqe->len);
+	fa->len = READ_ONCE(sqe->addr);
+	if (!fa->len)
+		fa->len = READ_ONCE(sqe->len);
 	fa->advice = READ_ONCE(sqe->fadvise_advice);
 	if (io_fadvise_force_async(fa))
 		req->flags |= REQ_F_FORCE_ASYNC;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 816e93e7f949..76896db7390d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3569,7 +3569,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
 			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
 			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
-			IORING_FEAT_RECVSEND_BUNDLE;
+			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_64BIT_ADVISE;
 
 	if (copy_to_user(params, p, sizeof(*p))) {
 		ret = -EFAULT;

[-- Attachment #3: liburing.txt --]
[-- Type: text/plain, Size: 1687 bytes --]

diff --git a/src/include/liburing.h b/src/include/liburing.h
index 0a02364540a8..6f6d9db2f08a 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -764,6 +764,21 @@ IOURINGINLINE void io_uring_prep_madvise(struct io_uring_sqe *sqe, void *addr,
 	sqe->fadvise_advice = (__u32) advice;
 }
 
+IOURINGINLINE void io_uring_prep_fadvise64(struct io_uring_sqe *sqe, int fd,
+					 __u64 offset, off_t len, int advice)
+{
+	io_uring_prep_rw(IORING_OP_FADVISE, sqe, fd, NULL, 0, offset);
+	sqe->addr = len;
+	sqe->fadvise_advice = (__u32) advice;
+}
+
+IOURINGINLINE void io_uring_prep_madvise64(struct io_uring_sqe *sqe, void *addr,
+					 off_t length, int advice)
+{
+	io_uring_prep_rw(IORING_OP_MADVISE, sqe, -1, addr, 0, length);
+	sqe->fadvise_advice = (__u32) advice;
+}
+
 IOURINGINLINE void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
 				      const void *buf, size_t len, int flags)
 {
diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
index 9330733efd66..12fbfcf3e689 100644
--- a/src/include/liburing/io_uring.h
+++ b/src/include/liburing/io_uring.h
@@ -537,6 +537,7 @@ struct io_uring_params {
 #define IORING_FEAT_LINKED_FILE		(1U << 12)
 #define IORING_FEAT_REG_REG_RING	(1U << 13)
 #define IORING_FEAT_RECVSEND_BUNDLE	(1U << 14)
+#define IORING_FEAT_64BIT_ADVISE	(1U << 15)
 
 /*
  * io_uring_register(2) opcodes and arguments
diff --git a/src/liburing-ffi.map b/src/liburing-ffi.map
index 3be48d02ac86..0e4bd9d1d78d 100644
--- a/src/liburing-ffi.map
+++ b/src/liburing-ffi.map
@@ -199,4 +199,6 @@ LIBURING_2.6 {
 } LIBURING_2.5;
 
 LIBURING_2.7 {
+		io_uring_prep_fadvise64;
+		io_uring_prep_madvise64;
 } LIBURING_2.6;

  reply	other threads:[~2024-06-01 15:35 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  9:43 madvise/fadvise 32-bit length Stefan
2024-06-01 14:19 ` Jens Axboe
2024-06-01 15:05   ` Jens Axboe
2024-06-01 15:22     ` Stefan
2024-06-01 15:35       ` Jens Axboe [this message]
2024-06-01 15:51         ` Stefan
2024-06-01 18:33           ` Jens Axboe
2024-06-02  8:58             ` Stefan
2024-06-02 14:49               ` Jens Axboe
2024-06-05  5:25                 ` Stefan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox