From: Stefan <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: madvise/fadvise 32-bit length
Date: Wed, 5 Jun 2024 07:25:16 +0200 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
[-- Attachment #1: Type: text/plain, Size: 6150 bytes --]
On 2/6/2024 16:49, Jens Axboe wrote:
> On 6/2/24 2:58 AM, Stefan wrote:
>> On 1/6/2024 20:33, Jens Axboe wrote:
>>> On 6/1/24 9:51 AM, Stefan wrote:
>>>> On 1/6/2024 17:35, Jens Axboe wrote:
>>>>> On 6/1/24 9:22 AM, Stefan wrote:
>>>>>> On 1/6/2024 17:05, Jens Axboe wrote:
>>>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote:
>>>>>>>> On 6/1/24 3:43 AM, Stefan wrote:
>>>>>>>>> io_uring uses the __u32 len field in order to pass the length to
>>>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on
>>>>>>>>> 64bit platforms.
>>>>>>>>>
>>>>>>>>> When using liburing, the length is silently truncated to 32bits (so
>>>>>>>>> 8GB length would become zero, which has a different meaning of "until
>>>>>>>>> the end of the file" for fadvise).
>>>>>>>>>
>>>>>>>>> If my understanding is correct, we could fix this by introducing new
>>>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead
>>>>>>>>> of the length field for length.
>>>>>>>>
>>>>>>>> We probably just want to introduce a flag and ensure that older stable
>>>>>>>> kernels check it, and then use a 64-bit field for it when the flag is
>>>>>>>> set.
>>>>>>>
>>>>>>> I think this should do it on the kernel side, as we already check these
>>>>>>> fields and return -EINVAL as needed. Should also be trivial to backport.
>>>>>>> Totally untested... Might want a FEAT flag for this, or something where
>>>>>>> it's detectable, to make the liburing change straight forward.
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>>>>>>> index 7085804c513c..cb7b881665e5 100644
>>>>>>> --- a/io_uring/advise.c
>>>>>>> +++ b/io_uring/advise.c
>>>>>>> @@ -17,14 +17,14 @@
>>>>>>> struct io_fadvise {
>>>>>>> struct file *file;
>>>>>>> u64 offset;
>>>>>>> - u32 len;
>>>>>>> + u64 len;
>>>>>>> u32 advice;
>>>>>>> };
>>>>>>> struct io_madvise {
>>>>>>> struct file *file;
>>>>>>> u64 addr;
>>>>>>> - u32 len;
>>>>>>> + u64 len;
>>>>>>> u32 advice;
>>>>>>> };
>>>>>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
>>>>>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
>>>>>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
>>>>>>> + if (sqe->buf_index || sqe->splice_fd_in)
>>>>>>> return -EINVAL;
>>>>>>> ma->addr = READ_ONCE(sqe->addr);
>>>>>>> - ma->len = READ_ONCE(sqe->len);
>>>>>>> + ma->len = READ_ONCE(sqe->off);
>>>>>>> + if (!ma->len)
>>>>>>> + ma->len = READ_ONCE(sqe->len);
>>>>>>> ma->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>>> req->flags |= REQ_F_FORCE_ASYNC;
>>>>>>> return 0;
>>>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>>> {
>>>>>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
>>>>>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
>>>>>>> + if (sqe->buf_index || sqe->splice_fd_in)
>>>>>>> return -EINVAL;
>>>>>>> fa->offset = READ_ONCE(sqe->off);
>>>>>>> - fa->len = READ_ONCE(sqe->len);
>>>>>>> + fa->len = READ_ONCE(sqe->addr);
>>>>>>> + if (!fa->len)
>>>>>>> + fa->len = READ_ONCE(sqe->len);
>>>>>>> fa->advice = READ_ONCE(sqe->fadvise_advice);
>>>>>>> if (io_fadvise_force_async(fa))
>>>>>>> req->flags |= REQ_F_FORCE_ASYNC;
>>>>>>>
>>>>>>
>>>>>>
>>>>>> If we want to have the length in the same field in both *ADVISE
>>>>>> operations, we can put a flag in splice_fd_in/optlen.
>>>>>
>>>>> I don't think that part matters that much.
>>>>>
>>>>>> Maybe the explicit flag is a bit clearer for users of the API
>>>>>> compared to the implicit flag when setting sqe->len to zero?
>>>>>
>>>>> We could go either way. The unused fields returning -EINVAL if set right
>>>>> now can serve as the flag field - if you have it set, then that is your
>>>>> length. If not, then the old style is the length. That's the approach I
>>>>> took, rather than add an explicit flag to it. Existing users that would
>>>>> set the 64-bit length fields would get -EINVAL already. And since the
>>>>> normal flags field is already used for advice flags, I'd prefer just
>>>>> using the existing 64-bit zero fields for it rather than add a flag in
>>>>> an odd location. Would also make for an easier backport to stable.
>>>>>
>>>>> But don't feel that strongly about that part.
>>>>>
>>>>> Attached kernel patch with FEAT added, and liburing patch with 64
>>>>> versions added.
>>>>>
>>>>
>>>> Sounds good!
>>>> Do we want to do anything about the current (32-bit) functions in
>>>> liburing? They silently truncate the user's values, so either marking
>>>> them deprecated or changing the type of length in the arguments to a
>>>> __u32 could help.
>>>
>>> I like changing it to an __u32, and then we'll add a note to the man
>>> page for them as well (with references to the 64-bit variants).
>>>
>>> I still need to write a test and actually test the patches, but I'll get
>>> to that Monday. If you want to write a test case that checks the 64-bit
>>> range, then please do!
>>>
>>
>> Maybe something like the following for madvise?
>> Create an 8GB file initialized with 0xaa, punch a (8GB - page_size)
>> hole using MADV_REMOVE, and check the contents. It requires support
>> for FALLOC_FL_PUNCH_HOLE in the filesystem.
>
> I think that looks very reasonable, and it's better than the DONTNEED
> and timings, it was always a pretty shitty test. We just need to ensure
> that we return T_EXIT_SKIP if the fs it's being run on doesn't support
> punching holes.
>
> FWIW, I did put the liburing changes in an 'advise' branch, so you could
> generate a patch against that. Once we're happy with it, it can get
> pulled into master.
>
Here's the patch against your branch.
[-- Attachment #2: 0001-Change-madvise-test-to-use-MADV_REMOVE.patch --]
[-- Type: text/x-patch, Size: 7485 bytes --]
From 1570cfdc494c9f38bac84f7fc837f69f99888ac0 Mon Sep 17 00:00:00 2001
From: Stefan Muenzel <[email protected]>
Date: Mon, 3 Jun 2024 18:40:27 +0200
Subject: [PATCH] Change madvise test to use MADV_REMOVE
Create an 8GB file initialized with 0xaa, punch a (8GB - page_size)
hole using MADV_REMOVE, and check the contents.
Requires support for FALLOC_FL_PUNCH_HOLE in the filesystem, which
is checked for
Signed-off-by: Stefan Muenzel <[email protected]>
---
test/helpers.c | 39 +++++++++------
test/madvise.c | 131 +++++++++++++++----------------------------------
2 files changed, 64 insertions(+), 106 deletions(-)
diff --git a/test/helpers.c b/test/helpers.c
index 779347f..ff5273d 100644
--- a/test/helpers.c
+++ b/test/helpers.c
@@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size)
*/
static void __t_create_file(const char *file, size_t size, char pattern)
{
- ssize_t ret;
- char *buf;
+ ssize_t ret = 0;
+ size_t size_remaining;
+ char *buf, *buf_loc;
int fd;
buf = t_malloc(size);
@@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t size, char pattern)
fd = open(file, O_WRONLY | O_CREAT, 0644);
assert(fd >= 0);
- ret = write(fd, buf, size);
+ size_remaining = size;
+ buf_loc = buf;
+ while (size_remaining > 0) {
+ ret = write(fd, buf_loc, size_remaining);
+ if (ret <= 0)
+ break;
+ size_remaining -= ret;
+ buf_loc += ret;
+ }
fsync(fd);
close(fd);
free(buf);
- assert(ret == size);
+ assert(size_remaining == 0);
}
void t_create_file(const char *file, size_t size)
@@ -264,7 +273,7 @@ bool t_probe_defer_taskrun(void)
int ret;
ret = io_uring_queue_init(1, &ring, IORING_SETUP_SINGLE_ISSUER |
- IORING_SETUP_DEFER_TASKRUN);
+ IORING_SETUP_DEFER_TASKRUN);
if (ret < 0)
return false;
io_uring_queue_exit(&ring);
@@ -291,12 +300,12 @@ unsigned __io_uring_flush_sq(struct io_uring *ring)
io_uring_smp_store_release(sq->ktail, tail);
}
/*
- * This load needs to be atomic, since sq->khead is written concurrently
- * by the kernel, but it doesn't need to be load_acquire, since the
- * kernel doesn't store to the submission queue; it advances khead just
- * to indicate that it's finished reading the submission queue entries
- * so they're available for us to write to.
- */
+ * This load needs to be atomic, since sq->khead is written concurrently
+ * by the kernel, but it doesn't need to be load_acquire, since the
+ * kernel doesn't store to the submission queue; it advances khead just
+ * to indicate that it's finished reading the submission queue entries
+ * so they're available for us to write to.
+ */
return tail - IO_URING_READ_ONCE(*sq->khead);
}
@@ -306,13 +315,13 @@ unsigned __io_uring_flush_sq(struct io_uring *ring)
void t_error(int status, int errnum, const char *format, ...)
{
va_list args;
- va_start(args, format);
+ va_start(args, format);
vfprintf(stderr, format, args);
- if (errnum)
- fprintf(stderr, ": %s", strerror(errnum));
+ if (errnum)
+ fprintf(stderr, ": %s", strerror(errnum));
fprintf(stderr, "\n");
va_end(args);
- exit(status);
+ exit(status);
}
diff --git a/test/madvise.c b/test/madvise.c
index 7938ec4..cf5cf4c 100644
--- a/test/madvise.c
+++ b/test/madvise.c
@@ -15,34 +15,7 @@
#include "helpers.h"
#include "liburing.h"
-#define FILE_SIZE (128 * 1024)
-
-#define LOOPS 100
-#define MIN_LOOPS 10
-
-static unsigned long long utime_since(const struct timeval *s,
- const struct timeval *e)
-{
- long long sec, usec;
-
- sec = e->tv_sec - s->tv_sec;
- usec = (e->tv_usec - s->tv_usec);
- if (sec > 0 && usec < 0) {
- sec--;
- usec += 1000000;
- }
-
- sec *= 1000000;
- return sec + usec;
-}
-
-static unsigned long long utime_since_now(struct timeval *tv)
-{
- struct timeval end;
-
- gettimeofday(&end, NULL);
- return utime_since(tv, &end);
-}
+#define FILE_SIZE (8ULL * 1024ULL * 1024ULL * 1024ULL)
static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice)
{
@@ -76,83 +49,68 @@ static int do_madvise(struct io_uring *ring, void *addr, off_t len, int advice)
unlink(".madvise.tmp");
exit(0);
} else if (ret) {
- fprintf(stderr, "cqe->res=%d\n", cqe->res);
+ fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res,
+ strerror(-cqe->res));
}
io_uring_cqe_seen(ring, cqe);
return ret;
}
-static long do_copy(int fd, char *buf, void *ptr)
-{
- struct timeval tv;
-
- gettimeofday(&tv, NULL);
- memcpy(buf, ptr, FILE_SIZE);
- return utime_since_now(&tv);
-}
-
static int test_madvise(struct io_uring *ring, const char *filename)
{
- unsigned long cached_read, uncached_read, cached_read2;
+ size_t page_size;
+ unsigned char contents;
int fd, ret;
- char *buf;
- void *ptr;
+ unsigned char *ptr;
- fd = open(filename, O_RDONLY);
+ page_size = sysconf(_SC_PAGE_SIZE);
+
+ fd = open(filename, O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
- buf = t_malloc(FILE_SIZE);
+ ret =
+ fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, page_size,
+ page_size);
+ if (ret == -1 && errno == EOPNOTSUPP)
+ return 3;
- ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+ ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (ptr == MAP_FAILED) {
perror("mmap");
return 1;
}
- cached_read = do_copy(fd, buf, ptr);
- if (cached_read == -1)
- return 1;
-
- cached_read = do_copy(fd, buf, ptr);
- if (cached_read == -1)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
+ ret =
+ do_madvise(ring, ptr + 2 * page_size, FILE_SIZE - page_size,
+ MADV_REMOVE);
if (ret)
return 1;
- uncached_read = do_copy(fd, buf, ptr);
- if (uncached_read == -1)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
- if (ret)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED);
- if (ret)
- return 1;
-
- msync(ptr, FILE_SIZE, MS_SYNC);
-
- cached_read2 = do_copy(fd, buf, ptr);
- if (cached_read2 == -1)
- return 1;
-
- if (cached_read < uncached_read &&
- cached_read2 < uncached_read)
- return 0;
+ for (size_t i = 0; i < FILE_SIZE; i++) {
+ contents = ptr[i];
+ if (contents && i > page_size) {
+ fprintf(stderr,
+ "In removed page at %lu but contents=%x\n", i,
+ contents);
+ return 2;
+ } else if (contents != 0xaa && i < page_size) {
+ fprintf(stderr,
+ "In non-removed page at %lu but contents=%x\n",
+ i, contents);
+ return 2;
+ }
+ }
- return 2;
+ return 0;
}
int main(int argc, char *argv[])
{
struct io_uring ring;
- int ret, i, good, bad;
+ int ret = 0;
char *fname;
if (argc > 1) {
@@ -167,23 +125,12 @@ int main(int argc, char *argv[])
goto err;
}
- good = bad = 0;
- for (i = 0; i < LOOPS; i++) {
- ret = test_madvise(&ring, fname);
- if (ret == 1) {
- fprintf(stderr, "test_madvise failed\n");
- goto err;
- } else if (!ret)
- good++;
- else if (ret == 2)
- bad++;
- if (i >= MIN_LOOPS && !bad)
- break;
+ ret = test_madvise(&ring, fname);
+ if (ret) {
+ fprintf(stderr, "test_madvise failed\n");
+ goto err;
}
- /* too hard to reliably test, just ignore */
- if ((0) && bad > good)
- fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good);
if (fname != argv[1])
unlink(fname);
io_uring_queue_exit(&ring);
@@ -191,5 +138,7 @@ int main(int argc, char *argv[])
err:
if (fname != argv[1])
unlink(fname);
+ if (ret == 3)
+ return T_EXIT_SKIP;
return T_EXIT_FAIL;
}
--
2.45.1
prev parent reply other threads:[~2024-06-05 5:27 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-01 9:43 madvise/fadvise 32-bit length Stefan
2024-06-01 14:19 ` Jens Axboe
2024-06-01 15:05 ` Jens Axboe
2024-06-01 15:22 ` Stefan
2024-06-01 15:35 ` Jens Axboe
2024-06-01 15:51 ` Stefan
2024-06-01 18:33 ` Jens Axboe
2024-06-02 8:58 ` Stefan
2024-06-02 14:49 ` Jens Axboe
2024-06-05 5:25 ` Stefan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox