public inbox for [email protected]
 help / color / mirror / Atom feed
From: Stefan <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: madvise/fadvise 32-bit length
Date: Sun, 2 Jun 2024 10:58:30 +0200	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 1/6/2024 20:33, Jens Axboe wrote:
> On 6/1/24 9:51 AM, Stefan wrote:
>> On 1/6/2024 17:35, Jens Axboe wrote:
>>> On 6/1/24 9:22 AM, Stefan wrote:
>>>> On 1/6/2024 17:05, Jens Axboe wrote:
>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote:
>>>>>> On 6/1/24 3:43 AM, Stefan wrote:
>>>>>>> io_uring uses the __u32 len field in order to pass the length to
>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on
>>>>>>> 64bit platforms.
>>>>>>>
>>>>>>> When using liburing, the length is silently truncated to 32bits (so
>>>>>>> 8GB length would become zero, which has a different meaning of "until
>>>>>>> the end of the file" for fadvise).
>>>>>>>
>>>>>>> If my understanding is correct, we could fix this by introducing new
>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead
>>>>>>> of the length field for length.
>>>>>>
>>>>>> We probably just want to introduce a flag and ensure that older stable
>>>>>> kernels check it, and then use a 64-bit field for it when the flag is
>>>>>> set.
>>>>>
>>>>> I think this should do it on the kernel side, as we already check these
>>>>> fields and return -EINVAL as needed. Should also be trivial to backport.
>>>>> Totally untested... Might want a FEAT flag for this, or something where
>>>>> it's detectable, to make the liburing change straight forward.
>>>>>
>>>>>
>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>>>>> index 7085804c513c..cb7b881665e5 100644
>>>>> --- a/io_uring/advise.c
>>>>> +++ b/io_uring/advise.c
>>>>> @@ -17,14 +17,14 @@
>>>>>     struct io_fadvise {
>>>>>         struct file            *file;
>>>>>         u64                offset;
>>>>> -    u32                len;
>>>>> +    u64                len;
>>>>>         u32                advice;
>>>>>     };
>>>>>       struct io_madvise {
>>>>>         struct file            *file;
>>>>>         u64                addr;
>>>>> -    u32                len;
>>>>> +    u64                len;
>>>>>         u32                advice;
>>>>>     };
>>>>>     @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>     #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
>>>>>         struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
>>>>>     -    if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
>>>>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>>>>             return -EINVAL;
>>>>>           ma->addr = READ_ONCE(sqe->addr);
>>>>> -    ma->len = READ_ONCE(sqe->len);
>>>>> +    ma->len = READ_ONCE(sqe->off);
>>>>> +    if (!ma->len)
>>>>> +        ma->len = READ_ONCE(sqe->len);
>>>>>         ma->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>         req->flags |= REQ_F_FORCE_ASYNC;
>>>>>         return 0;
>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>     {
>>>>>         struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
>>>>>     -    if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
>>>>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>>>>             return -EINVAL;
>>>>>           fa->offset = READ_ONCE(sqe->off);
>>>>> -    fa->len = READ_ONCE(sqe->len);
>>>>> +    fa->len = READ_ONCE(sqe->addr);
>>>>> +    if (!fa->len)
>>>>> +        fa->len = READ_ONCE(sqe->len);
>>>>>         fa->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>         if (io_fadvise_force_async(fa))
>>>>>             req->flags |= REQ_F_FORCE_ASYNC;
>>>>>
>>>>
>>>>
>>>> If we want to have the length in the same field in both *ADVISE
>>>> operations, we can put a flag in splice_fd_in/optlen.
>>>
>>> I don't think that part matters that much.
>>>
>>>> Maybe the explicit flag is a bit clearer for users of the API
>>>> compared to the implicit flag when setting sqe->len to zero?
>>>
>>> We could go either way. The unused fields returning -EINVAL if set right
>>> now can serve as the flag field - if you have it set, then that is your
>>> length. If not, then the old style is the length. That's the approach I
>>> took, rather than add an explicit flag to it. Existing users that would
>>> set the 64-bit length fields would get -EINVAL already. And since the
>>> normal flags field is already used for advice flags, I'd prefer just
>>> using the existing 64-bit zero fields for it rather than add a flag in
>>> an odd location. Would also make for an easier backport to stable.
>>>
>>> But don't feel that strongly about that part.
>>>
>>> Attached kernel patch with FEAT added, and liburing patch with 64
>>> versions added.
>>>
>>
>> Sounds good!
>> Do we want to do anything about the current (32-bit) functions in
>> liburing? They silently truncate the user's values, so either marking
>> them deprecated or changing the type of length in the arguments to a
>> __u32 could help.
> 
> I like changing it to an __u32, and then we'll add a note to the man
> page for them as well (with references to the 64-bit variants).
> 
> I still need to write a test and actually test the patches, but I'll get
> to that Monday. If you want to write a test case that checks the 64-bit
> range, then please do!
> 

Maybe something like the following for madvise?
Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) hole 
using MADV_REMOVE, and check the contents.
It requires support for FALLOC_FL_PUNCH_HOLE in the filesystem.


diff --git a/test/helpers.c b/test/helpers.c
index 779347f..acf1c7d 100644
--- a/test/helpers.c
+++ b/test/helpers.c
@@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size)
   */
  static void __t_create_file(const char *file, size_t size, char pattern)
  {
-	ssize_t ret;
-	char *buf;
+	ssize_t ret = 0;
+	size_t size_remaining;
+	char *buf, *buf_loc;
  	int fd;

  	buf = t_malloc(size);
@@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t 
size, char pattern)
  	fd = open(file, O_WRONLY | O_CREAT, 0644);
  	assert(fd >= 0);

-	ret = write(fd, buf, size);
+	size_remaining = size;
+	buf_loc = buf;
+	while (size_remaining > 0) {
+		ret = write(fd, buf_loc, size_remaining);
+		if (ret <= 0)
+			break;
+		size_remaining -= ret;
+		buf_loc += ret;
+	}
  	fsync(fd);
  	close(fd);
  	free(buf);
-	assert(ret == size);
+	assert(size_remaining == 0);
  }

  void t_create_file(const char *file, size_t size)
diff --git a/test/madvise.c b/test/madvise.c
index 7938ec4..b5b0cbe 100644
--- a/test/madvise.c
+++ b/test/madvise.c
@@ -15,34 +15,7 @@
  #include "helpers.h"
  #include "liburing.h"

-#define FILE_SIZE	(128 * 1024)
-
-#define LOOPS		100
-#define MIN_LOOPS	10
-
-static unsigned long long utime_since(const struct timeval *s,
-				      const struct timeval *e)
-{
-	long long sec, usec;
-
-	sec = e->tv_sec - s->tv_sec;
-	usec = (e->tv_usec - s->tv_usec);
-	if (sec > 0 && usec < 0) {
-		sec--;
-		usec += 1000000;
-	}
-
-	sec *= 1000000;
-	return sec + usec;
-}
-
-static unsigned long long utime_since_now(struct timeval *tv)
-{
-	struct timeval end;
-
-	gettimeofday(&end, NULL);
-	return utime_since(tv, &end);
-}
+#define FILE_SIZE	(8ULL * 1024ULL * 1024ULL * 1024ULL)

  static int do_madvise(struct io_uring *ring, void *addr, off_t len, 
int advice)
  {
@@ -76,83 +49,62 @@ static int do_madvise(struct io_uring *ring, void 
*addr, off_t len, int advice)
  		unlink(".madvise.tmp");
  		exit(0);
  	} else if (ret) {
-		fprintf(stderr, "cqe->res=%d\n", cqe->res);
+		fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res,
+			strerror(-cqe->res));
  	}
  	io_uring_cqe_seen(ring, cqe);
  	return ret;
  }

-static long do_copy(int fd, char *buf, void *ptr)
-{
-	struct timeval tv;
-
-	gettimeofday(&tv, NULL);
-	memcpy(buf, ptr, FILE_SIZE);
-	return utime_since_now(&tv);
-}
-
  static int test_madvise(struct io_uring *ring, const char *filename)
  {
-	unsigned long cached_read, uncached_read, cached_read2;
+	size_t page_size;
+	unsigned char contents;
  	int fd, ret;
-	char *buf;
-	void *ptr;
+	unsigned char *ptr;
+
+	page_size = sysconf(_SC_PAGE_SIZE);

-	fd = open(filename, O_RDONLY);
+	fd = open(filename, O_RDWR);
  	if (fd < 0) {
  		perror("open");
  		return 1;
  	}

-	buf = t_malloc(FILE_SIZE);
-
-	ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+	ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  	if (ptr == MAP_FAILED) {
  		perror("mmap");
  		return 1;
  	}

-	cached_read = do_copy(fd, buf, ptr);
-	if (cached_read == -1)
-		return 1;
-
-	cached_read = do_copy(fd, buf, ptr);
-	if (cached_read == -1)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
-	if (ret)
-		return 1;
-
-	uncached_read = do_copy(fd, buf, ptr);
-	if (uncached_read == -1)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
-	if (ret)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED);
+	ret =
+	    do_madvise(ring, ptr + page_size, FILE_SIZE - page_size,
+		       MADV_REMOVE);
  	if (ret)
  		return 1;

-	msync(ptr, FILE_SIZE, MS_SYNC);
-
-	cached_read2 = do_copy(fd, buf, ptr);
-	if (cached_read2 == -1)
-		return 1;
-
-	if (cached_read < uncached_read &&
-	    cached_read2 < uncached_read)
-		return 0;
+	for (size_t i = 0; i < FILE_SIZE; i++) {
+		contents = ptr[i];
+		if (contents && i > page_size) {
+			fprintf(stderr,
+				"In removed page at %lu but contents=%x\n", i,
+				contents);
+			return 2;
+		} else if (contents != 0xaa && i < page_size) {
+			fprintf(stderr,
+				"In non-removed page at %lu but contents=%x\n",
+				i, contents);
+			return 2;
+		}
+	}

-	return 2;
+	return 0;
  }

  int main(int argc, char *argv[])
  {
  	struct io_uring ring;
-	int ret, i, good, bad;
+	int ret;
  	char *fname;

  	if (argc > 1) {
@@ -167,23 +119,12 @@ int main(int argc, char *argv[])
  		goto err;
  	}

-	good = bad = 0;
-	for (i = 0; i < LOOPS; i++) {
-		ret = test_madvise(&ring, fname);
-		if (ret == 1) {
-			fprintf(stderr, "test_madvise failed\n");
-			goto err;
-		} else if (!ret)
-			good++;
-		else if (ret == 2)
-			bad++;
-		if (i >= MIN_LOOPS && !bad)
-			break;
+	ret = test_madvise(&ring, fname);
+	if (ret) {
+		fprintf(stderr, "test_madvise failed\n");
+		goto err;
  	}

-	/* too hard to reliably test, just ignore */
-	if ((0) && bad > good)
-		fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good);
  	if (fname != argv[1])
  		unlink(fname);
  	io_uring_queue_exit(&ring);


  reply	other threads:[~2024-06-02  8:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  9:43 madvise/fadvise 32-bit length Stefan
2024-06-01 14:19 ` Jens Axboe
2024-06-01 15:05   ` Jens Axboe
2024-06-01 15:22     ` Stefan
2024-06-01 15:35       ` Jens Axboe
2024-06-01 15:51         ` Stefan
2024-06-01 18:33           ` Jens Axboe
2024-06-02  8:58             ` Stefan [this message]
2024-06-02 14:49               ` Jens Axboe
2024-06-05  5:25                 ` Stefan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox