public inbox for [email protected]
 help / color / mirror / Atom feed
From: David Hildenbrand <[email protected]>
To: Jens Axboe <[email protected]>,
	Andrew Dona-Couch <[email protected]>,
	Andrew Morton <[email protected]>,
	Drew DeVault <[email protected]>
Cc: Ammar Faizi <[email protected]>,
	[email protected], [email protected],
	io_uring Mailing List <[email protected]>,
	Pavel Begunkov <[email protected]>,
	[email protected]
Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB
Date: Tue, 23 Nov 2021 13:02:03 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

>>>>
>>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>>>> like what was suggested, if that will help move things forward. IMHO the
>>>> 32MB machine is mostly a theoretical case, but whatever .
>>>
>>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
>>> where FOLL_LONGTERM cannot be used, as that memory is not available.
>>>
>>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
>>> system completely and deeply mess up the MM. Oh my.
>>
>> We're talking per-user limits here. But if you want to talk hyperbole,
>> then 64K multiplied by some other random number will also allow
>> everything to be pinned, potentially.
>>
> 
> Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in
> the worst case.
> 

To make it clear why I keep complaining about FOLL_LONGTERM for
unprivileged users even if we're talking about "only" 0.1% of RAM ...

On x86-64 a 2 MiB THP (IOW pageblock) has 512 sub-pages. If we manage to
FOLL_LONGTERM a single sub-page, we can make the THP unavailable to the
system, meaning we cannot form a THP by
compaction/swapping/migration/whatever at that physical memory area
until we unpin that single page. We essentially "block" a THP from
forming at that physical memory area.

So with a single 4k page we can block one 2 MiB THP. With 0.1% we can,
therefore, block 51,2 % of all THP. Theoretically, of course, if the
stars align.


... or if we're malicious or unlucky. I wrote a reproducer this morning
that tries blocking as many THP as it can:

https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c

------------------------------------------------------------------------

Example on my 16 GiB (8096 THP "in theory") notebook with some
applications running in the background.

$ uname -a
Linux t480s 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021
x86_64 x86_64 x86_64 GNU/Linux

$ ./io_uring_thp
PAGE size: 4096 bytes (sensed)
THP size: 2097152 bytes (sensed)
RLIMIT_MEMLOCK: 16777216 bytes (sensed)
IORING_MAX_REG_BUFFERS: 16384 (guess)
Pages per THP: 512
User can block 4096 THP (8589934592 bytes)
Process can block 4096 THP (8589934592 bytes)
Blocking 1 THP
Blocking 2 THP
...
Blocking 3438 THP
Blocking 3439 THP
Blocking 3440 THP
Blocking 3441 THP
Blocking 3442 THP
... and after a while
Blocking 4093 THP
Blocking 4094 THP
Blocking 4095 THP
Blocking 4096 THP

$ cat /proc/`pgrep io_uring_thp`/status
Name:   io_uring_thp
Umask:  0002
State:  S (sleeping)
[...]
VmPeak:     6496 kB
VmSize:     6496 kB
VmLck:         0 kB
VmPin:     16384 kB
VmHWM:      3628 kB
VmRSS:      1580 kB
RssAnon:             160 kB
RssFile:            1420 kB
RssShmem:              0 kB
VmData:     4304 kB
VmStk:       136 kB
VmExe:         8 kB
VmLib:      1488 kB
VmPTE:        48 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1

$ cat /proc/meminfo
MemTotal:       16250920 kB
MemFree:        11648016 kB
MemAvailable:   11972196 kB
Buffers:           50480 kB
Cached:          1156768 kB
SwapCached:        54680 kB
Active:           704788 kB
Inactive:        3477576 kB
Active(anon):     427716 kB
Inactive(anon):  3207604 kB
Active(file):     277072 kB
Inactive(file):   269972 kB
...
Mlocked:            5692 kB
SwapTotal:       8200188 kB
SwapFree:        7742716 kB
...
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB


Let's see how many contiguous 2M pages we can still get as root:
$ echo 1 > /proc/sys/vm/compact_memory
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
0
$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
537
... keep retrying a couple of times
 $ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
583

Let's kill the io_uring process and try again:

$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
4766
... keep retrying a couple of times
$ echo 8192 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
4823

------------------------------------------------------------------------

I'm going to leave judgment how bad this is or isn't to the educated
reader, and I'll stop spending time on this as I have more important
things to work on.


To summarize my humble opinion:

1) I am not against raising the default memlock limit if it's for a sane
use case. While mlock itself can be somewhat bad for swap, FOLL_LONGTERM
that also checks the memlock limit here is the real issue. This patch
explicitly states the "IOURING_REGISTER_BUFFERS" use case, though, and
that makes me nervous.

2) Exposing FOLL_LONGTERM to unprivileged users should be avoided best
we can; in an ideal world, we wouldn't have it at all; in a sub-optimal
world we'd have it only for use cases that really require it due to HW
limitations. Ideally we'd even have yet another limit for this, because
mlock != FOLL_LONGTERM.

3) IOURING_REGISTER_BUFFERS shouldn't use FOLL_LONGTERM for use by
unprivileged users. We should provide a variant that doesn't rely on
FOLL_LONGTERM or even rely on the memlock limit.


Sorry to the patch author for bringing it up as response to the patch.
After this patch just does what some distros already do (many distros
even provide higher limits than 8 MiB!). I would be curious why some
distros already have such high values ... and if it's already because of
IOURING_REGISTER_BUFFERS after all.

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2021-11-23 12:02 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-28  8:08 [PATCH] Increase default MLOCK_LIMIT to 8 MiB Drew DeVault
2021-10-28 18:22 ` Jens Axboe
2021-11-04 14:27   ` Cyril Hrubis
2021-11-04 14:44     ` Jens Axboe
2021-11-06  2:33 ` Ammar Faizi
2021-11-06  7:05   ` Drew DeVault
2021-11-06  7:12     ` Ammar Faizi
2021-11-16  4:35       ` Andrew Morton
2021-11-16  6:32         ` Drew DeVault
2021-11-16 19:47           ` Andrew Morton
2021-11-16 19:48             ` Drew DeVault
2021-11-16 21:37               ` Andrew Morton
2021-11-17  8:23                 ` Drew DeVault
2021-11-22 17:11                 ` David Hildenbrand
2021-11-22 17:55                   ` Andrew Dona-Couch
2021-11-22 18:26                     ` David Hildenbrand
2021-11-22 19:53                       ` Jens Axboe
2021-11-22 20:03                         ` Matthew Wilcox
2021-11-22 20:04                           ` Jens Axboe
2021-11-22 20:08                         ` David Hildenbrand
2021-11-22 20:44                           ` Jens Axboe
2021-11-22 21:56                             ` David Hildenbrand
2021-11-23 12:02                               ` David Hildenbrand [this message]
2021-11-23 13:25                           ` Jason Gunthorpe
2021-11-23 13:39                             ` David Hildenbrand
2021-11-23 14:07                               ` Jason Gunthorpe
2021-11-23 14:44                                 ` David Hildenbrand
2021-11-23 17:00                                   ` Jason Gunthorpe
2021-11-23 17:04                                     ` David Hildenbrand
2021-11-23 22:04                                     ` Vlastimil Babka
2021-11-23 23:59                                       ` Jason Gunthorpe
2021-11-24  8:57                                         ` David Hildenbrand
2021-11-24 13:23                                           ` Jason Gunthorpe
2021-11-24 13:25                                             ` David Hildenbrand
2021-11-24 13:28                                               ` Jason Gunthorpe
2021-11-24 13:29                                                 ` David Hildenbrand
2021-11-24 13:48                                                   ` Jason Gunthorpe
2021-11-24 14:14                                                     ` David Hildenbrand
2021-11-24 15:34                                                       ` Jason Gunthorpe
2021-11-24 16:43                                                         ` David Hildenbrand
2021-11-24 18:35                                                           ` Jason Gunthorpe
2021-11-24 19:09                                                             ` David Hildenbrand
2021-11-24 23:11                                                               ` Jason Gunthorpe
2021-11-30 15:52                                                                 ` David Hildenbrand
2021-11-24 18:37                                                           ` David Hildenbrand
2021-11-24 14:37                                           ` Vlastimil Babka
2021-11-24 14:41                                             ` David Hildenbrand
2021-11-16 18:36         ` Matthew Wilcox
2021-11-16 18:44           ` Drew DeVault
2021-11-16 18:55           ` Jens Axboe
2021-11-16 19:21             ` Vito Caputo
2021-11-16 19:25               ` Drew DeVault
2021-11-16 19:46                 ` Vito Caputo
2021-11-16 19:41               ` Jens Axboe
2021-11-17 22:26         ` Johannes Weiner
2021-11-17 23:17           ` Jens Axboe
2021-11-18 21:58             ` Andrew Morton
2021-11-19  7:41               ` Drew DeVault

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox