From: Stefan <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Subject: Re: madvise/fadvise 32-bit length
Date: Sun, 2 Jun 2024 10:58:30 +0200 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 1/6/2024 20:33, Jens Axboe wrote:
> On 6/1/24 9:51 AM, Stefan wrote:
>> On 1/6/2024 17:35, Jens Axboe wrote:
>>> On 6/1/24 9:22 AM, Stefan wrote:
>>>> On 1/6/2024 17:05, Jens Axboe wrote:
>>>>> On 6/1/24 8:19 AM, Jens Axboe wrote:
>>>>>> On 6/1/24 3:43 AM, Stefan wrote:
>>>>>>> io_uring uses the __u32 len field in order to pass the length to
>>>>>>> madvise and fadvise, but these calls use an off_t, which is 64bit on
>>>>>>> 64bit platforms.
>>>>>>>
>>>>>>> When using liburing, the length is silently truncated to 32bits (so
>>>>>>> 8GB length would become zero, which has a different meaning of "until
>>>>>>> the end of the file" for fadvise).
>>>>>>>
>>>>>>> If my understanding is correct, we could fix this by introducing new
>>>>>>> operations MADVISE64 and FADVISE64, which use the addr3 field instead
>>>>>>> of the length field for length.
>>>>>>
>>>>>> We probably just want to introduce a flag and ensure that older stable
>>>>>> kernels check it, and then use a 64-bit field for it when the flag is
>>>>>> set.
>>>>>
>>>>> I think this should do it on the kernel side, as we already check these
>>>>> fields and return -EINVAL as needed. Should also be trivial to backport.
>>>>> Totally untested... Might want a FEAT flag for this, or something where
>>>>> it's detectable, to make the liburing change straight forward.
>>>>>
>>>>>
>>>>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>>>>> index 7085804c513c..cb7b881665e5 100644
>>>>> --- a/io_uring/advise.c
>>>>> +++ b/io_uring/advise.c
>>>>> @@ -17,14 +17,14 @@
>>>>> struct io_fadvise {
>>>>> struct file *file;
>>>>> u64 offset;
>>>>> - u32 len;
>>>>> + u64 len;
>>>>> u32 advice;
>>>>> };
>>>>> struct io_madvise {
>>>>> struct file *file;
>>>>> u64 addr;
>>>>> - u32 len;
>>>>> + u64 len;
>>>>> u32 advice;
>>>>> };
>>>>> @@ -33,11 +33,13 @@ int io_madvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>> #if defined(CONFIG_ADVISE_SYSCALLS) && defined(CONFIG_MMU)
>>>>> struct io_madvise *ma = io_kiocb_to_cmd(req, struct io_madvise);
>>>>> - if (sqe->buf_index || sqe->off || sqe->splice_fd_in)
>>>>> + if (sqe->buf_index || sqe->splice_fd_in)
>>>>> return -EINVAL;
>>>>> ma->addr = READ_ONCE(sqe->addr);
>>>>> - ma->len = READ_ONCE(sqe->len);
>>>>> + ma->len = READ_ONCE(sqe->off);
>>>>> + if (!ma->len)
>>>>> + ma->len = READ_ONCE(sqe->len);
>>>>> ma->advice = READ_ONCE(sqe->fadvise_advice);
>>>>> req->flags |= REQ_F_FORCE_ASYNC;
>>>>> return 0;
>>>>> @@ -78,11 +80,13 @@ int io_fadvise_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>> {
>>>>> struct io_fadvise *fa = io_kiocb_to_cmd(req, struct io_fadvise);
>>>>> - if (sqe->buf_index || sqe->addr || sqe->splice_fd_in)
>>>>> + if (sqe->buf_index || sqe->splice_fd_in)
>>>>> return -EINVAL;
>>>>> fa->offset = READ_ONCE(sqe->off);
>>>>> - fa->len = READ_ONCE(sqe->len);
>>>>> + fa->len = READ_ONCE(sqe->addr);
>>>>> + if (!fa->len)
>>>>> + fa->len = READ_ONCE(sqe->len);
>>>>> fa->advice = READ_ONCE(sqe->fadvise_advice);
>>>>> if (io_fadvise_force_async(fa))
>>>>> req->flags |= REQ_F_FORCE_ASYNC;
>>>>>
>>>>
>>>>
>>>> If we want to have the length in the same field in both *ADVISE
>>>> operations, we can put a flag in splice_fd_in/optlen.
>>>
>>> I don't think that part matters that much.
>>>
>>>> Maybe the explicit flag is a bit clearer for users of the API
>>>> compared to the implicit flag when setting sqe->len to zero?
>>>
>>> We could go either way. The unused fields returning -EINVAL if set right
>>> now can serve as the flag field - if you have it set, then that is your
>>> length. If not, then the old style is the length. That's the approach I
>>> took, rather than add an explicit flag to it. Existing users that would
>>> set the 64-bit length fields would get -EINVAL already. And since the
>>> normal flags field is already used for advice flags, I'd prefer just
>>> using the existing 64-bit zero fields for it rather than add a flag in
>>> an odd location. Would also make for an easier backport to stable.
>>>
>>> But don't feel that strongly about that part.
>>>
>>> Attached kernel patch with FEAT added, and liburing patch with 64
>>> versions added.
>>>
>>
>> Sounds good!
>> Do we want to do anything about the current (32-bit) functions in
>> liburing? They silently truncate the user's values, so either marking
>> them deprecated or changing the type of length in the arguments to a
>> __u32 could help.
>
> I like changing it to an __u32, and then we'll add a note to the man
> page for them as well (with references to the 64-bit variants).
>
> I still need to write a test and actually test the patches, but I'll get
> to that Monday. If you want to write a test case that checks the 64-bit
> range, then please do!
>
Maybe something like the following for madvise?
Create an 8GB file initialized with 0xaa, punch a (8GB - page_size) hole
using MADV_REMOVE, and check the contents.
It requires support for FALLOC_FL_PUNCH_HOLE in the filesystem.
diff --git a/test/helpers.c b/test/helpers.c
index 779347f..acf1c7d 100644
--- a/test/helpers.c
+++ b/test/helpers.c
@@ -76,8 +76,9 @@ void *t_calloc(size_t nmemb, size_t size)
*/
static void __t_create_file(const char *file, size_t size, char pattern)
{
- ssize_t ret;
- char *buf;
+ ssize_t ret = 0;
+ size_t size_remaining;
+ char *buf, *buf_loc;
int fd;
buf = t_malloc(size);
@@ -86,11 +87,19 @@ static void __t_create_file(const char *file, size_t
size, char pattern)
fd = open(file, O_WRONLY | O_CREAT, 0644);
assert(fd >= 0);
- ret = write(fd, buf, size);
+ size_remaining = size;
+ buf_loc = buf;
+ while (size_remaining > 0) {
+ ret = write(fd, buf_loc, size_remaining);
+ if (ret <= 0)
+ break;
+ size_remaining -= ret;
+ buf_loc += ret;
+ }
fsync(fd);
close(fd);
free(buf);
- assert(ret == size);
+ assert(size_remaining == 0);
}
void t_create_file(const char *file, size_t size)
diff --git a/test/madvise.c b/test/madvise.c
index 7938ec4..b5b0cbe 100644
--- a/test/madvise.c
+++ b/test/madvise.c
@@ -15,34 +15,7 @@
#include "helpers.h"
#include "liburing.h"
-#define FILE_SIZE (128 * 1024)
-
-#define LOOPS 100
-#define MIN_LOOPS 10
-
-static unsigned long long utime_since(const struct timeval *s,
- const struct timeval *e)
-{
- long long sec, usec;
-
- sec = e->tv_sec - s->tv_sec;
- usec = (e->tv_usec - s->tv_usec);
- if (sec > 0 && usec < 0) {
- sec--;
- usec += 1000000;
- }
-
- sec *= 1000000;
- return sec + usec;
-}
-
-static unsigned long long utime_since_now(struct timeval *tv)
-{
- struct timeval end;
-
- gettimeofday(&end, NULL);
- return utime_since(tv, &end);
-}
+#define FILE_SIZE (8ULL * 1024ULL * 1024ULL * 1024ULL)
static int do_madvise(struct io_uring *ring, void *addr, off_t len,
int advice)
{
@@ -76,83 +49,62 @@ static int do_madvise(struct io_uring *ring, void
*addr, off_t len, int advice)
unlink(".madvise.tmp");
exit(0);
} else if (ret) {
- fprintf(stderr, "cqe->res=%d\n", cqe->res);
+ fprintf(stderr, "cqe->res=%d (%s)\n", cqe->res,
+ strerror(-cqe->res));
}
io_uring_cqe_seen(ring, cqe);
return ret;
}
-static long do_copy(int fd, char *buf, void *ptr)
-{
- struct timeval tv;
-
- gettimeofday(&tv, NULL);
- memcpy(buf, ptr, FILE_SIZE);
- return utime_since_now(&tv);
-}
-
static int test_madvise(struct io_uring *ring, const char *filename)
{
- unsigned long cached_read, uncached_read, cached_read2;
+ size_t page_size;
+ unsigned char contents;
int fd, ret;
- char *buf;
- void *ptr;
+ unsigned char *ptr;
+
+ page_size = sysconf(_SC_PAGE_SIZE);
- fd = open(filename, O_RDONLY);
+ fd = open(filename, O_RDWR);
if (fd < 0) {
perror("open");
return 1;
}
- buf = t_malloc(FILE_SIZE);
-
- ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+ ptr = mmap(NULL, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (ptr == MAP_FAILED) {
perror("mmap");
return 1;
}
- cached_read = do_copy(fd, buf, ptr);
- if (cached_read == -1)
- return 1;
-
- cached_read = do_copy(fd, buf, ptr);
- if (cached_read == -1)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
- if (ret)
- return 1;
-
- uncached_read = do_copy(fd, buf, ptr);
- if (uncached_read == -1)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_DONTNEED);
- if (ret)
- return 1;
-
- ret = do_madvise(ring, ptr, FILE_SIZE, MADV_WILLNEED);
+ ret =
+ do_madvise(ring, ptr + page_size, FILE_SIZE - page_size,
+ MADV_REMOVE);
if (ret)
return 1;
- msync(ptr, FILE_SIZE, MS_SYNC);
-
- cached_read2 = do_copy(fd, buf, ptr);
- if (cached_read2 == -1)
- return 1;
-
- if (cached_read < uncached_read &&
- cached_read2 < uncached_read)
- return 0;
+ for (size_t i = 0; i < FILE_SIZE; i++) {
+ contents = ptr[i];
+ if (contents && i > page_size) {
+ fprintf(stderr,
+ "In removed page at %lu but contents=%x\n", i,
+ contents);
+ return 2;
+ } else if (contents != 0xaa && i < page_size) {
+ fprintf(stderr,
+ "In non-removed page at %lu but contents=%x\n",
+ i, contents);
+ return 2;
+ }
+ }
- return 2;
+ return 0;
}
int main(int argc, char *argv[])
{
struct io_uring ring;
- int ret, i, good, bad;
+ int ret;
char *fname;
if (argc > 1) {
@@ -167,23 +119,12 @@ int main(int argc, char *argv[])
goto err;
}
- good = bad = 0;
- for (i = 0; i < LOOPS; i++) {
- ret = test_madvise(&ring, fname);
- if (ret == 1) {
- fprintf(stderr, "test_madvise failed\n");
- goto err;
- } else if (!ret)
- good++;
- else if (ret == 2)
- bad++;
- if (i >= MIN_LOOPS && !bad)
- break;
+ ret = test_madvise(&ring, fname);
+ if (ret) {
+ fprintf(stderr, "test_madvise failed\n");
+ goto err;
}
- /* too hard to reliably test, just ignore */
- if ((0) && bad > good)
- fprintf(stderr, "Suspicious timings (%u > %u)\n", bad, good);
if (fname != argv[1])
unlink(fname);
io_uring_queue_exit(&ring);
next prev parent reply other threads:[~2024-06-02 8:58 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-01 9:43 madvise/fadvise 32-bit length Stefan
2024-06-01 14:19 ` Jens Axboe
2024-06-01 15:05 ` Jens Axboe
2024-06-01 15:22 ` Stefan
2024-06-01 15:35 ` Jens Axboe
2024-06-01 15:51 ` Stefan
2024-06-01 18:33 ` Jens Axboe
2024-06-02 8:58 ` Stefan [this message]
2024-06-02 14:49 ` Jens Axboe
2024-06-05 5:25 ` Stefan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox