public inbox for [email protected]
 help / color / mirror / Atom feed
From: David Hildenbrand <[email protected]>
To: Jason Gunthorpe <[email protected]>
Cc: Vlastimil Babka <[email protected]>, Jens Axboe <[email protected]>,
	Andrew Dona-Couch <[email protected]>,
	Andrew Morton <[email protected]>,
	Drew DeVault <[email protected]>,
	Ammar Faizi <[email protected]>,
	[email protected], [email protected],
	io_uring Mailing List <[email protected]>,
	Pavel Begunkov <[email protected]>,
	[email protected]
Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB
Date: Wed, 24 Nov 2021 17:43:58 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 24.11.21 16:34, Jason Gunthorpe wrote:
> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote:
> 
>> I'm not aware of any where you can fragment 50% of all pageblocks in the
>> system as an unprivileged user essentially consuming almost no memory
>> and essentially staying inside well-defined memlock limits. But sure if
>> there are "many" people will be able to come up with at least one
>> comparable thing. I'll be happy to learn.
> 
> If the concern is that THP's can be DOS'd then any avenue that renders
> the system out of THPs is a DOS attack vector. Including all the
> normal workloads that people run and already complain that THPs get
> exhausted.
> 
> A hostile userspace can only quicken this process.

We can not only fragment THP but also easily smaller compound pages,
with less impact though (well, as long as people want more than 0.1% per
user ...).

We want to make more excessive use of THP; the whole folio work is about
using THP. Some people are even working on increasing the MAX_ORDER and
introduce gigantic THP.

And here we are having mechanisms available to unprivileged users to
just sabotage the very thing at its core extremely easily. Personally, I
think this is very bad, but that's just my humble opinion.

> 
>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go
>> stands as it is.
> 
> As this basically excludes long standing pre-existing things like
> RDMA, XDP, io_uring, and more I don't think this can be the general
> answer for mm, sorry.

Let's think about options to restrict FOLL_LONGTERM usage:

One option would be to add toggle(s) (e.g., kernel cmdline options) to
make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The
admin can opt in if unprivileged users should have that capability. A
distro might overwrite the default and set it to "on". I'm not
completely happy about that.

Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK,
but instead as something that explicitly matches the differing
semantics. We could have a limit for privileged and one for unprivileged
users. The default in the kernel could be 0 but an admin/system can
overwrite it to opt in and a distro might apply different rules. Yes,
we're back to the original question about limits, but now with the
thought that FOLL_LONGTERM really is different than mlock and
potentially more dangerous.

At the same time, eventually work on proper alternatives with mmu
notifiers (and possibly without the any such limits) where possible and
required. (I assume it's hardly possible for RDMA because of the way the
hardware works)

Just some ideas, open for alternatives. I know that for the cases where
we want it to "just work" for unprivileged users but cannot even have
alternative implementations, this is bad.

> 
> Sure, lets stop now since I don't think we can agree.

Don't get me wrong, I really should be working on other stuff, so I have
limited brain capacity and time :) OTOH I'm willing to help at least
discuss alternatives.


Let's think about realistic alternatives to keep FOLL_LONGTERM for any
user working (that would tackle the extreme fragmentation issue at
least, ignoring e.g., other fragmentation we can trigger with
FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA):

The nasty thing really is splitting a compound page and then pinning
some pages, even if it's pinning the complete compound range. Ideally,
we'd defer any action to the time we actually FOLL_LONGTERM pin a page.


a) I think we cannot migrate pages when splitting the PMD (e.g., unmap,
MADV_DONTNEED, swap?, page compaction?). User space can just pin the
compound page to block migration.

b) We might migrate pages when splitting the compound page. In
split_huge_page_to_list() we know that we have nobody pinning the page.
I did not check if it's possible. There might be cases where it's not
immediately clear if it's possible (e.g., inside shrink_page_list())

It would mean that we would migrate pages essentially any time we split
a compound page because there could be someone FOLL_LONGTERM pinning the
page later. Usually we'd expect page compaction to fix this up on actual
demand. I'd call this sub-optimal.

c) We migrate any time someone FOLL_LONGTERM pins a page and the page is
not pinned yet -- because it might have been a split compound page. I
think we can agree that that's not an option :)

d) We remember if a page was part of a compound page and was not freed
yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately,
we're short on pageflags for anon pages I think.

Hm, alternatives?

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2021-11-24 16:44 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-28  8:08 [PATCH] Increase default MLOCK_LIMIT to 8 MiB Drew DeVault
2021-10-28 18:22 ` Jens Axboe
2021-11-04 14:27   ` Cyril Hrubis
2021-11-04 14:44     ` Jens Axboe
2021-11-06  2:33 ` Ammar Faizi
2021-11-06  7:05   ` Drew DeVault
2021-11-06  7:12     ` Ammar Faizi
2021-11-16  4:35       ` Andrew Morton
2021-11-16  6:32         ` Drew DeVault
2021-11-16 19:47           ` Andrew Morton
2021-11-16 19:48             ` Drew DeVault
2021-11-16 21:37               ` Andrew Morton
2021-11-17  8:23                 ` Drew DeVault
2021-11-22 17:11                 ` David Hildenbrand
2021-11-22 17:55                   ` Andrew Dona-Couch
2021-11-22 18:26                     ` David Hildenbrand
2021-11-22 19:53                       ` Jens Axboe
2021-11-22 20:03                         ` Matthew Wilcox
2021-11-22 20:04                           ` Jens Axboe
2021-11-22 20:08                         ` David Hildenbrand
2021-11-22 20:44                           ` Jens Axboe
2021-11-22 21:56                             ` David Hildenbrand
2021-11-23 12:02                               ` David Hildenbrand
2021-11-23 13:25                           ` Jason Gunthorpe
2021-11-23 13:39                             ` David Hildenbrand
2021-11-23 14:07                               ` Jason Gunthorpe
2021-11-23 14:44                                 ` David Hildenbrand
2021-11-23 17:00                                   ` Jason Gunthorpe
2021-11-23 17:04                                     ` David Hildenbrand
2021-11-23 22:04                                     ` Vlastimil Babka
2021-11-23 23:59                                       ` Jason Gunthorpe
2021-11-24  8:57                                         ` David Hildenbrand
2021-11-24 13:23                                           ` Jason Gunthorpe
2021-11-24 13:25                                             ` David Hildenbrand
2021-11-24 13:28                                               ` Jason Gunthorpe
2021-11-24 13:29                                                 ` David Hildenbrand
2021-11-24 13:48                                                   ` Jason Gunthorpe
2021-11-24 14:14                                                     ` David Hildenbrand
2021-11-24 15:34                                                       ` Jason Gunthorpe
2021-11-24 16:43                                                         ` David Hildenbrand [this message]
2021-11-24 18:35                                                           ` Jason Gunthorpe
2021-11-24 19:09                                                             ` David Hildenbrand
2021-11-24 23:11                                                               ` Jason Gunthorpe
2021-11-30 15:52                                                                 ` David Hildenbrand
2021-11-24 18:37                                                           ` David Hildenbrand
2021-11-24 14:37                                           ` Vlastimil Babka
2021-11-24 14:41                                             ` David Hildenbrand
2021-11-16 18:36         ` Matthew Wilcox
2021-11-16 18:44           ` Drew DeVault
2021-11-16 18:55           ` Jens Axboe
2021-11-16 19:21             ` Vito Caputo
2021-11-16 19:25               ` Drew DeVault
2021-11-16 19:46                 ` Vito Caputo
2021-11-16 19:41               ` Jens Axboe
2021-11-17 22:26         ` Johannes Weiner
2021-11-17 23:17           ` Jens Axboe
2021-11-18 21:58             ` Andrew Morton
2021-11-19  7:41               ` Drew DeVault

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox