public inbox for [email protected]
 help / color / mirror / Atom feed
From: Stefan <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: madvise/fadvise 32-bit length
Date: Wed, 5 Jun 2024 07:25:16 +0200	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

[-- Attachment #1: Type: text/plain, Size: 6150 bytes --]

On 2/6/2024 16:49, Jens Axboe wrote:
> On 6/2/24 2:58 AM, Stefan wrote:
>> On 1/6/2024 20:33, Jens Axboe wrote:
>>> On 6/1/24 9:51 AM, Stefan wrote:
>>>> On 1/6/2024 17:35, Jens Axboe wrote:
>>>>> On 6/1/24 9:22 AM, Stefan wrote:
>>>>>> On 1/6/2024 17:05, Jens Axboe wrote:
>>>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote:
>>>>>>>> On 6/1/24 3:43 AM, Stefan wrote:
>>>>>>>>> io_uring uses the __u32 len field in order to pass the length to
>>>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on
>>>>>>>>> 64bit platforms.
>>>>>>>>>
>>>>>>>>> When using liburing, the length is silently truncated to 32bits (so
>>>>>>>>> 8GB length would become zero, which has a different meaning of "until
>>>>>>>>> the end of the file" for fadvise).
>>>>>>>>>
>>>>>>>>> If my understanding is correct, we could fix this by introducing new
>>>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead
>>>>>>>>> of the length field for length.
>>>>>>>>
>>>>>>>> We probably just want to introduce a flag and ensure that older stable
>>>>>>>> kernels check it, and then use a 64-bit field for it when the flag is
>>>>>>>> set.
>>>>>>>
>>>>>>> I think this should do it on the kernel side, as we already check these
>>>>>>> fields and return -EINVAL as needed. Should also be trivial to backport.
>>>>>>> Totally untested... Might want a FEAT flag for this, or something where
>>>>>>> it's detectable, to make the liburing change straight forward.
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>>>>>>> index 7085804c513c..cb7b881665e5 100644
>>>>>>> --- a/io_uring/advise.c
>>>>>>> +++ b/io_uring/advise.c
>>>>>>> @@ -17,14 +17,14 @@
>>>>>>>      struct io_fadvise {
>>>>>>>          struct file            *file;
>>>>>>>          u64                offset;
>>>>>>> -    u32                len;
>>>>>>> +    u64                len;
>>>>>>>          u32                advice;
>>>>>>>      };
>>>>>>>        struct io_madvise {
>>>>>>>          struct file            *file;
>>>>>>>          u64                addr;
>>>>>>> -    u32                len;
>>>>>>> +    u64                len;
>>>>>>>          u32                advice;
>>>>>>>      };
>>>>>>>      @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>>      #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
>>>>>>>          struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
>>>>>>>      -    if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
>>>>>>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>>>>>>              return -EINVAL;
>>>>>>>            ma->addr = READ_ONCE(sqe->addr);
>>>>>>> -    ma->len = READ_ONCE(sqe->len);
>>>>>>> +    ma->len = READ_ONCE(sqe->off);
>>>>>>> +    if (!ma->len)
>>>>>>> +        ma->len = READ_ONCE(sqe->len);
>>>>>>>          ma->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>>>          req->flags |= REQ_F_FORCE_ASYNC;
>>>>>>>          return 0;
>>>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>>      {
>>>>>>>          struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
>>>>>>>      -    if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
>>>>>>> +    if (sqe->buf_index || sqe->splice_fd_in)
>>>>>>>              return -EINVAL;
>>>>>>>            fa->offset = READ_ONCE(sqe->off);
>>>>>>> -    fa->len = READ_ONCE(sqe->len);
>>>>>>> +    fa->len = READ_ONCE(sqe->addr);
>>>>>>> +    if (!fa->len)
>>>>>>> +        fa->len = READ_ONCE(sqe->len);
>>>>>>>          fa->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>>>          if (io_fadvise_force_async(fa))
>>>>>>>              req->flags |= REQ_F_FORCE_ASYNC;
>>>>>>>
>>>>>>
>>>>>>
>>>>>> If we want to have the length in the same field in both *ADVISE
>>>>>> operations, we can put a flag in splice_fd_in/optlen.
>>>>>
>>>>> I don't think that part matters that much.
>>>>>
>>>>>> Maybe the explicit flag is a bit clearer for users of the API
>>>>>> compared to the implicit flag when setting sqe->len to zero?
>>>>>
>>>>> We could go either way. The unused fields returning -EINVAL if set right
>>>>> now can serve as the flag field - if you have it set, then that is your
>>>>> length. If not, then the old style is the length. That's the approach I
>>>>> took, rather than add an explicit flag to it. Existing users that would
>>>>> set the 64-bit length fields would get -EINVAL already. And since the
>>>>> normal flags field is already used for advice flags, I'd prefer just
>>>>> using the existing 64-bit zero fields for it rather than add a flag in
>>>>> an odd location. Would also make for an easier backport to stable.
>>>>>
>>>>> But don't feel that strongly about that part.
>>>>>
>>>>> Attached kernel patch with FEAT added, and liburing patch with 64
>>>>> versions added.
>>>>>
>>>>
>>>> Sounds good!
>>>> Do we want to do anything about the current (32-bit) functions in
>>>> liburing? They silently truncate the user's values, so either marking
>>>> them deprecated or changing the type of length in the arguments to a
>>>> __u32 could help.
>>>
>>> I like changing it to an __u32, and then we'll add a note to the man
>>> page for them as well (with references to the 64-bit variants).
>>>
>>> I still need to write a test and actually test the patches, but I'll get
>>> to that Monday. If you want to write a test case that checks the 64-bit
>>> range, then please do!
>>>
>>
>> Maybe something like the following for madvise?
>> Create an 8GB file initialized with 0xaa, punch a (8GB - page_size)
>> hole using MADV_REMOVE, and check the contents. It requires support
>> for FALLOC_FL_PUNCH_HOLE in the filesystem.
> 
> I think that looks very reasonable, and it's better than the DONTNEED
> and timings, it was always a pretty shitty test. We just need to ensure
> that we return T_EXIT_SKIP if the fs it's being run on doesn't support
> punching holes.
> 
> FWIW, I did put the liburing changes in an 'advise' branch, so you could
> generate a patch against that. Once we're happy with it, it can get
> pulled into master.
> 

Here's the patch against your branch.

[-- Attachment #2: 0001-Change-madvise-test-to-use-MADV_REMOVE.patch --]
[-- Type: text/x-patch, Size: 7485 bytes --]

From 1570cfdc494c9f38bac84f7fc837f69f99888ac0 Mon Sep 17 00:00:00 2001
From: Stefan Muenzel <[email protected]>
Date: Mon, 3 Jun 2024 18:40:27 +0200
Subject: [PATCH] Change madvise test to use MADV_REMOVE

Create an 8GB file initialized with 0xaa, punch a (8GB - page_size)
hole using MADV_REMOVE, and check the contents.
Requires support for FALLOC_FL_PUNCH_HOLE in the filesystem, which
is checked for

Signed-off-by: Stefan Muenzel <[email protected]>
---
 test/helpers.c |  39 +++++++++------
 test/madvise.c | 131 +++++++++++++++----------------------------------
 2 files changed, 64 insertions(+), 106 deletions(-)

diff --git a/test/helpers.c b/test/helpers.c
index 779347f..ff5273d 100644
--- a/test/helpers.c
+++ b/test/helpers.c
@@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size)
  */
 static void __t_create_file(const char *file, size_t size, char pattern)
 {
-	ssize_t ret;
-	char *buf;
+	ssize_t ret = 0;
+	size_t size_remaining;
+	char *buf, *buf_loc;
 	int fd;
 
 	buf = t_malloc(size);
@@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t size, char pattern)
 	fd = open(file, O_WRONLY | O_CREAT, 0644);
 	assert(fd >= 0);
 
-	ret = write(fd, buf, size);
+	size_remaining = size;
+	buf_loc = buf;
+	while (size_remaining > 0) {
+		ret = write(fd, buf_loc, size_remaining);
+		if (ret <= 0)
+			break;
+		size_remaining -= ret;
+		buf_loc += ret;
+	}
 	fsync(fd);
 	close(fd);
 	free(buf);
-	assert(ret == size);
+	assert(size_remaining == 0);
 }
 
 void t_create_file(const char *file, size_t size)
@@ -264,7 +273,7 @@ bool t_probe_defer_taskrun(void)
 	int ret;
 
 	ret = io_uring_queue_init(1, &ring, IORING_SETUP_SINGLE_ISSUER |
-					    IORING_SETUP_DEFER_TASKRUN);
+				  IORING_SETUP_DEFER_TASKRUN);
 	if (ret < 0)
 		return false;
 	io_uring_queue_exit(&ring);
@@ -291,12 +300,12 @@ unsigned __io_uring_flush_sq(struct io_uring *ring)
 			io_uring_smp_store_release(sq->ktail, tail);
 	}
 	/*
-	* This load needs to be atomic, since sq->khead is written concurrently
-	* by the kernel, but it doesn't need to be load_acquire, since the
-	* kernel doesn't store to the submission queue; it advances khead just
-	* to indicate that it's finished reading the submission queue entries
-	* so they're available for us to write to.
-	*/
+	 * This load needs to be atomic, since sq->khead is written concurrently
+	 * by the kernel, but it doesn't need to be load_acquire, since the
+	 * kernel doesn't store to the submission queue; it advances khead just
+	 * to indicate that it's finished reading the submission queue entries
+	 * so they're available for us to write to.
+	 */
 	return tail - IO_URING_READ_ONCE(*sq->khead);
 }
 
@@ -306,13 +315,13 @@ unsigned __io_uring_flush_sq(struct io_uring *ring)
 void t_error(int status, int errnum, const char *format, ...)
 {
 	va_list args;
-    	va_start(args, format);
+	va_start(args, format);
 
 	vfprintf(stderr, format, args);
-    	if (errnum)
-        	fprintf(stderr, ": %s", strerror(errnum));
+	if (errnum)
+		fprintf(stderr, ": %s", strerror(errnum));
 
 	fprintf(stderr, "\n");
 	va_end(args);
-    	exit(status);
+	exit(status);
 }
diff --git a/test/madvise.c b/test/madvise.c
index 7938ec4..cf5cf4c 100644
--- a/test/madvise.c
+++ b/test/madvise.c
@@ -15,34 +15,7 @@
 #include "helpers.h"
 #include "liburing.h"
 
-#define FILE_SIZE	(128 * 1024)
-
-#define LOOPS		100
-#define MIN_LOOPS	10
-
-static unsigned long long utime_since(const struct timeval *s,
-				      const struct timeval *e)
-{
-	long long sec, usec;
-
-	sec = e->tv_sec - s->tv_sec;
-	usec = (e->tv_usec - s->tv_usec);
-	if (sec > 0 && usec < 0) {
-		sec--;
-		usec += 1000000;
-	}
-
-	sec *= 1000000;
-	return sec + usec;
-}
-
-static unsigned long long utime_since_now(struct timeval *tv)
-{
-	struct timeval end;
-
-	gettimeofday(&end, NULL);
-	return utime_since(tv, &end);
-}
+#define FILE_SIZE    (8ULL * 1024ULL * 1024ULL * 1024ULL)
 
 static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice)
 {
@@ -76,83 +49,68 @@ static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice)
 		unlink(".madvise.tmp");
 		exit(0);
 	} else if (ret) {
-		fprintf(stderr, "cqe->res=%d\n", cqe->res);
+		fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res,
+			strerror(-cqe->res));
 	}
 	io_uring_cqe_seen(ring, cqe);
 	return ret;
 }
 
-static long do_copy(int fd, char *buf, void *ptr)
-{
-	struct timeval tv;
-
-	gettimeofday(&tv, NULL);
-	memcpy(buf, ptr, FILE_SIZE);
-	return utime_since_now(&tv);
-}
-
 static int test_madvise(struct io_uring *ring, const char *filename)
 {
-	unsigned long cached_read, uncached_read, cached_read2;
+	size_t page_size;
+	unsigned char contents;
 	int fd, ret;
-	char *buf;
-	void *ptr;
+	unsigned char *ptr;
 
-	fd = open(filename, O_RDONLY);
+	page_size = sysconf(_SC_PAGE_SIZE);
+
+	fd = open(filename, O_RDWR);
 	if (fd < 0) {
 		perror("open");
 		return 1;
 	}
 
-	buf = t_malloc(FILE_SIZE);
+	ret =
+	    fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, page_size,
+		      page_size);
+	if (ret == -1 && errno == EOPNOTSUPP)
+		return 3;
 
-	ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+	ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	if (ptr == MAP_FAILED) {
 		perror("mmap");
 		return 1;
 	}
 
-	cached_read = do_copy(fd, buf, ptr);
-	if (cached_read == -1)
-		return 1;
-
-	cached_read = do_copy(fd, buf, ptr);
-	if (cached_read == -1)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
+	ret =
+	    do_madvise(ring, ptr + 2 * page_size, FILE_SIZE - page_size,
+		       MADV_REMOVE);
 	if (ret)
 		return 1;
 
-	uncached_read = do_copy(fd, buf, ptr);
-	if (uncached_read == -1)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
-	if (ret)
-		return 1;
-
-	ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED);
-	if (ret)
-		return 1;
-
-	msync(ptr, FILE_SIZE, MS_SYNC);
-
-	cached_read2 = do_copy(fd, buf, ptr);
-	if (cached_read2 == -1)
-		return 1;
-
-	if (cached_read < uncached_read &&
-	    cached_read2 < uncached_read)
-		return 0;
+	for (size_t i = 0; i < FILE_SIZE; i++) {
+		contents = ptr[i];
+		if (contents && i > page_size) {
+			fprintf(stderr,
+				"In removed page at %lu but contents=%x\n", i,
+				contents);
+			return 2;
+		} else if (contents != 0xaa && i < page_size) {
+			fprintf(stderr,
+				"In non-removed page at %lu but contents=%x\n",
+				i, contents);
+			return 2;
+		}
+	}
 
-	return 2;
+	return 0;
 }
 
 int main(int argc, char *argv[])
 {
 	struct io_uring ring;
-	int ret, i, good, bad;
+	int ret = 0;
 	char *fname;
 
 	if (argc > 1) {
@@ -167,23 +125,12 @@ int main(int argc, char *argv[])
 		goto err;
 	}
 
-	good = bad = 0;
-	for (i = 0; i < LOOPS; i++) {
-		ret = test_madvise(&ring, fname);
-		if (ret == 1) {
-			fprintf(stderr, "test_madvise failed\n");
-			goto err;
-		} else if (!ret)
-			good++;
-		else if (ret == 2)
-			bad++;
-		if (i >= MIN_LOOPS && !bad)
-			break;
+	ret = test_madvise(&ring, fname);
+	if (ret) {
+		fprintf(stderr, "test_madvise failed\n");
+		goto err;
 	}
 
-	/* too hard to reliably test, just ignore */
-	if ((0) && bad > good)
-		fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good);
 	if (fname != argv[1])
 		unlink(fname);
 	io_uring_queue_exit(&ring);
@@ -191,5 +138,7 @@ int main(int argc, char *argv[])
 err:
 	if (fname != argv[1])
 		unlink(fname);
+	if (ret == 3)
+		return T_EXIT_SKIP;
 	return T_EXIT_FAIL;
 }
-- 
2.45.1


      reply	other threads:[~2024-06-05  5:27 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-01  9:43 madvise/fadvise 32-bit length Stefan
2024-06-01 14:19 ` Jens Axboe
2024-06-01 15:05   ` Jens Axboe
2024-06-01 15:22     ` Stefan
2024-06-01 15:35       ` Jens Axboe
2024-06-01 15:51         ` Stefan
2024-06-01 18:33           ` Jens Axboe
2024-06-02  8:58             ` Stefan
2024-06-02 14:49               ` Jens Axboe
2024-06-05  5:25                 ` Stefan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox