* [PATCH] Increase default MLOCK_LIMIT to 8 MiB @ 2021-10-28 8:08 Drew DeVault 2021-10-28 18:22 ` Jens Axboe 2021-11-06 2:33 ` Ammar Faizi 0 siblings, 2 replies; 58+ messages in thread From: Drew DeVault @ 2021-10-28 8:08 UTC (permalink / raw) To: linux-kernel, linux-api, io-uring Cc: Drew DeVault, Jens Axboe, Pavel Begunkov This limit has not been updated since 2008, when it was increased to 64 KiB at the request of GnuPG. Until recently, the main use-cases for this feature were (1) preventing sensitive memory from being swapped, as in GnuPG's use-case; and (2) real-time use-cases. In the first case, little memory is called for, and in the second case, the user is generally in a position to increase it if they need more. The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: preparing fixed buffers for high-performance I/O. This use-case will take as much of this memory as it can get, but is still limited to 64 KiB by default, which is very little. This increases the limit to 8 MB, which was chosen fairly arbitrarily as a more generous, but still conservative, default value. --- It is also possible to raise this limit in userspace. This is easily done, for example, in the use-case of a network daemon: systemd, for instance, provides for this via LimitMEMLOCK in the service file; OpenRC via the rc_ulimit variables. However, there is no established userspace facility for configuring this outside of daemons: end-user applications do not presently have access to a convenient means of raising their limits. The buck, as it were, stops with the kernel. It's much easier to address it here than it is to bring it to hundreds of distributions, and it can only realistically be relied upon to be high-enough by end-user software if it is more-or-less ubiquitous. Most distros don't change this particular rlimit from the kernel-supplied default value, so a change here will easily provide that ubiquity. include/uapi/linux/resource.h | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/resource.h b/include/uapi/linux/resource.h index 74ef57b38f9f..c858c3c85fae 100644 --- a/include/uapi/linux/resource.h +++ b/include/uapi/linux/resource.h @@ -66,10 +66,17 @@ struct rlimit64 { #define _STK_LIM (8*1024*1024) /* - * GPG2 wants 64kB of mlocked memory, to make sure pass phrases - * and other sensitive information are never written to disk. + * Limit the amount of locked memory by some sane default: + * root can always increase this limit if needed. + * + * The main use-cases are (1) preventing sensitive memory + * from being swapped; (2) real-time operations; (3) via + * IOURING_REGISTER_BUFFERS. + * + * The first two don't need much. The latter will take as + * much as it can get. 8MB is a reasonably sane default. */ -#define MLOCK_LIMIT ((PAGE_SIZE > 64*1024) ? PAGE_SIZE : 64*1024) +#define MLOCK_LIMIT ((PAGE_SIZE > 8*1024*1024) ? PAGE_SIZE : 8*1024*1024) /* * Due to binary compatibility, the actual resource numbers -- 2.33.1 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-10-28 8:08 [PATCH] Increase default MLOCK_LIMIT to 8 MiB Drew DeVault @ 2021-10-28 18:22 ` Jens Axboe 2021-11-04 14:27 ` Cyril Hrubis 2021-11-06 2:33 ` Ammar Faizi 1 sibling, 1 reply; 58+ messages in thread From: Jens Axboe @ 2021-10-28 18:22 UTC (permalink / raw) To: Drew DeVault, linux-kernel, linux-api, io-uring; +Cc: Pavel Begunkov On 10/28/21 2:08 AM, Drew DeVault wrote: > This limit has not been updated since 2008, when it was increased to 64 > KiB at the request of GnuPG. Until recently, the main use-cases for this > feature were (1) preventing sensitive memory from being swapped, as in > GnuPG's use-case; and (2) real-time use-cases. In the first case, little > memory is called for, and in the second case, the user is generally in a > position to increase it if they need more. > > The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: > preparing fixed buffers for high-performance I/O. This use-case will > take as much of this memory as it can get, but is still limited to 64 > KiB by default, which is very little. This increases the limit to 8 MB, > which was chosen fairly arbitrarily as a more generous, but still > conservative, default value. > --- > It is also possible to raise this limit in userspace. This is easily > done, for example, in the use-case of a network daemon: systemd, for > instance, provides for this via LimitMEMLOCK in the service file; OpenRC > via the rc_ulimit variables. However, there is no established userspace > facility for configuring this outside of daemons: end-user applications > do not presently have access to a convenient means of raising their > limits. > > The buck, as it were, stops with the kernel. It's much easier to address > it here than it is to bring it to hundreds of distributions, and it can > only realistically be relied upon to be high-enough by end-user software > if it is more-or-less ubiquitous. Most distros don't change this > particular rlimit from the kernel-supplied default value, so a change > here will easily provide that ubiquity. Agree with raising this limit, it is ridiculously low and we often get reports from people that can't even do basic rings with it. Particularly when bpf is involved as well, as it also dips into this pool. On the production side at facebook, we do raise this limit as well. Acked-by: Jens Axboe <[email protected]> -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-10-28 18:22 ` Jens Axboe @ 2021-11-04 14:27 ` Cyril Hrubis 2021-11-04 14:44 ` Jens Axboe 0 siblings, 1 reply; 58+ messages in thread From: Cyril Hrubis @ 2021-11-04 14:27 UTC (permalink / raw) To: Jens Axboe Cc: Drew DeVault, linux-kernel, linux-api, io-uring, Pavel Begunkov Hi! > > This limit has not been updated since 2008, when it was increased to 64 > > KiB at the request of GnuPG. Until recently, the main use-cases for this > > feature were (1) preventing sensitive memory from being swapped, as in > > GnuPG's use-case; and (2) real-time use-cases. In the first case, little > > memory is called for, and in the second case, the user is generally in a > > position to increase it if they need more. > > > > The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: > > preparing fixed buffers for high-performance I/O. This use-case will > > take as much of this memory as it can get, but is still limited to 64 > > KiB by default, which is very little. This increases the limit to 8 MB, > > which was chosen fairly arbitrarily as a more generous, but still > > conservative, default value. > > --- > > It is also possible to raise this limit in userspace. This is easily > > done, for example, in the use-case of a network daemon: systemd, for > > instance, provides for this via LimitMEMLOCK in the service file; OpenRC > > via the rc_ulimit variables. However, there is no established userspace > > facility for configuring this outside of daemons: end-user applications > > do not presently have access to a convenient means of raising their > > limits. > > > > The buck, as it were, stops with the kernel. It's much easier to address > > it here than it is to bring it to hundreds of distributions, and it can > > only realistically be relied upon to be high-enough by end-user software > > if it is more-or-less ubiquitous. Most distros don't change this > > particular rlimit from the kernel-supplied default value, so a change > > here will easily provide that ubiquity. > > Agree with raising this limit, it is ridiculously low and we often get > reports from people that can't even do basic rings with it. Particularly > when bpf is involved as well, as it also dips into this pool. > > On the production side at facebook, we do raise this limit as well. We are raising this limit to 2MB for LTP testcases as well, otherwise we get failures when we run a few bpf tests in quick succession. Acked-by: Cyril Hrubis <[email protected]> -- Cyril Hrubis [email protected] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-04 14:27 ` Cyril Hrubis @ 2021-11-04 14:44 ` Jens Axboe 0 siblings, 0 replies; 58+ messages in thread From: Jens Axboe @ 2021-11-04 14:44 UTC (permalink / raw) To: Cyril Hrubis Cc: Drew DeVault, linux-kernel, linux-api, io-uring, Pavel Begunkov, Andrew Morton On 11/4/21 8:27 AM, Cyril Hrubis wrote: > Hi! >>> This limit has not been updated since 2008, when it was increased to 64 >>> KiB at the request of GnuPG. Until recently, the main use-cases for this >>> feature were (1) preventing sensitive memory from being swapped, as in >>> GnuPG's use-case; and (2) real-time use-cases. In the first case, little >>> memory is called for, and in the second case, the user is generally in a >>> position to increase it if they need more. >>> >>> The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: >>> preparing fixed buffers for high-performance I/O. This use-case will >>> take as much of this memory as it can get, but is still limited to 64 >>> KiB by default, which is very little. This increases the limit to 8 MB, >>> which was chosen fairly arbitrarily as a more generous, but still >>> conservative, default value. >>> --- >>> It is also possible to raise this limit in userspace. This is easily >>> done, for example, in the use-case of a network daemon: systemd, for >>> instance, provides for this via LimitMEMLOCK in the service file; OpenRC >>> via the rc_ulimit variables. However, there is no established userspace >>> facility for configuring this outside of daemons: end-user applications >>> do not presently have access to a convenient means of raising their >>> limits. >>> >>> The buck, as it were, stops with the kernel. It's much easier to address >>> it here than it is to bring it to hundreds of distributions, and it can >>> only realistically be relied upon to be high-enough by end-user software >>> if it is more-or-less ubiquitous. Most distros don't change this >>> particular rlimit from the kernel-supplied default value, so a change >>> here will easily provide that ubiquity. >> >> Agree with raising this limit, it is ridiculously low and we often get >> reports from people that can't even do basic rings with it. Particularly >> when bpf is involved as well, as it also dips into this pool. >> >> On the production side at facebook, we do raise this limit as well. > > We are raising this limit to 2MB for LTP testcases as well, otherwise we > get failures when we run a few bpf tests in quick succession.> > Acked-by: Cyril Hrubis <[email protected]> Andrew, care to pick this one up for 5.16? -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-10-28 8:08 [PATCH] Increase default MLOCK_LIMIT to 8 MiB Drew DeVault 2021-10-28 18:22 ` Jens Axboe @ 2021-11-06 2:33 ` Ammar Faizi 2021-11-06 7:05 ` Drew DeVault 1 sibling, 1 reply; 58+ messages in thread From: Ammar Faizi @ 2021-11-06 2:33 UTC (permalink / raw) To: Drew DeVault Cc: linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov On Thu, Oct 28, 2021 at 3:08 PM Drew DeVault wrote: > > This limit has not been updated since 2008, when it was increased to 64 > KiB at the request of GnuPG. Until recently, the main use-cases for this > feature were (1) preventing sensitive memory from being swapped, as in > GnuPG's use-case; and (2) real-time use-cases. In the first case, little > memory is called for, and in the second case, the user is generally in a > position to increase it if they need more. > > The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: > preparing fixed buffers for high-performance I/O. This use-case will > take as much of this memory as it can get, but is still limited to 64 > KiB by default, which is very little. This increases the limit to 8 MB, > which was chosen fairly arbitrarily as a more generous, but still > conservative, default value. > --- > It is also possible to raise this limit in userspace. This is easily > done, for example, in the use-case of a network daemon: systemd, for > instance, provides for this via LimitMEMLOCK in the service file; OpenRC > via the rc_ulimit variables. However, there is no established userspace > facility for configuring this outside of daemons: end-user applications > do not presently have access to a convenient means of raising their > limits. > > The buck, as it were, stops with the kernel. It's much easier to address > it here than it is to bring it to hundreds of distributions, and it can > only realistically be relied upon to be high-enough by end-user software > if it is more-or-less ubiquitous. Most distros don't change this > particular rlimit from the kernel-supplied default value, so a change > here will easily provide that ubiquity. > > include/uapi/linux/resource.h | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) Forgot to add Signed-off-by tag from the author? -- Ammar Faizi ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-06 2:33 ` Ammar Faizi @ 2021-11-06 7:05 ` Drew DeVault 2021-11-06 7:12 ` Ammar Faizi 0 siblings, 1 reply; 58+ messages in thread From: Drew DeVault @ 2021-11-06 7:05 UTC (permalink / raw) To: Ammar Faizi Cc: linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov Should I send a v2 or is this email sufficient: Signed-off-by: Drew DeVault <[email protected]> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-06 7:05 ` Drew DeVault @ 2021-11-06 7:12 ` Ammar Faizi 2021-11-16 4:35 ` Andrew Morton 0 siblings, 1 reply; 58+ messages in thread From: Ammar Faizi @ 2021-11-06 7:12 UTC (permalink / raw) To: Drew DeVault Cc: linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, Andrew Morton On 11/6/21 2:05 PM, Drew DeVault wrote: > Should I send a v2 or is this email sufficient: > > Signed-off-by: Drew DeVault <[email protected]> Oops, I missed akpm from the CC list. Added Andrew. Cc: Andrew Morton <[email protected]> Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/ -- Ammar Faizi ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-06 7:12 ` Ammar Faizi @ 2021-11-16 4:35 ` Andrew Morton 2021-11-16 6:32 ` Drew DeVault ` (2 more replies) 0 siblings, 3 replies; 58+ messages in thread From: Andrew Morton @ 2021-11-16 4:35 UTC (permalink / raw) To: Ammar Faizi Cc: Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Sat, 6 Nov 2021 14:12:45 +0700 Ammar Faizi <[email protected]> wrote: > On 11/6/21 2:05 PM, Drew DeVault wrote: > > Should I send a v2 or is this email sufficient: > > > > Signed-off-by: Drew DeVault <[email protected]> > > Oops, I missed akpm from the CC list. Added Andrew. > > Cc: Andrew Morton <[email protected]> > Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/ Let's cc linux-mm as well. Unfortunately I didn't know about this until Nov 4, which was formally too late for 5.16. I guess I could try to sneak it past Linus if someone were to send me some sufficiently convincing words explaining the urgency. I'd also be interested in seeing feedback from the MM developers. And a question: rather than messing around with a constant which will need to be increased again in a couple of years, can we solve this one and for all? For example, permit root to set the system-wide per-process max mlock size and depend upon initscripts to do this appropriately. From: Drew DeVault <[email protected]> Subject: Increase default MLOCK_LIMIT to 8 MiB This limit has not been updated since 2008, when it was increased to 64 KiB at the request of GnuPG. Until recently, the main use-cases for this feature were (1) preventing sensitive memory from being swapped, as in GnuPG's use-case; and (2) real-time use-cases. In the first case, little memory is called for, and in the second case, the user is generally in a position to increase it if they need more. The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: preparing fixed buffers for high-performance I/O. This use-case will take as much of this memory as it can get, but is still limited to 64 KiB by default, which is very little. This increases the limit to 8 MB, which was chosen fairly arbitrarily as a more generous, but still conservative, default value. It is also possible to raise this limit in userspace. This is easily done, for example, in the use-case of a network daemon: systemd, for instance, provides for this via LimitMEMLOCK in the service file; OpenRC via the rc_ulimit variables. However, there is no established userspace facility for configuring this outside of daemons: end-user applications do not presently have access to a convenient means of raising their limits. The buck, as it were, stops with the kernel. It's much easier to address it here than it is to bring it to hundreds of distributions, and it can only realistically be relied upon to be high-enough by end-user software if it is more-or-less ubiquitous. Most distros don't change this particular rlimit from the kernel-supplied default value, so a change here will easily provide that ubiquity. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Drew DeVault <[email protected]> Acked-by: Jens Axboe <[email protected]> Acked-by: Cyril Hrubis <[email protected]> Cc: Pavel Begunkov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> --- include/uapi/linux/resource.h | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) --- a/include/uapi/linux/resource.h~increase-default-mlock_limit-to-8-mib +++ a/include/uapi/linux/resource.h @@ -66,10 +66,17 @@ struct rlimit64 { #define _STK_LIM (8*1024*1024) /* - * GPG2 wants 64kB of mlocked memory, to make sure pass phrases - * and other sensitive information are never written to disk. + * Limit the amount of locked memory by some sane default: + * root can always increase this limit if needed. + * + * The main use-cases are (1) preventing sensitive memory + * from being swapped; (2) real-time operations; (3) via + * IOURING_REGISTER_BUFFERS. + * + * The first two don't need much. The latter will take as + * much as it can get. 8MB is a reasonably sane default. */ -#define MLOCK_LIMIT ((PAGE_SIZE > 64*1024) ? PAGE_SIZE : 64*1024) +#define MLOCK_LIMIT ((PAGE_SIZE > 8*1024*1024) ? PAGE_SIZE : 8*1024*1024) /* * Due to binary compatibility, the actual resource numbers _ ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 4:35 ` Andrew Morton @ 2021-11-16 6:32 ` Drew DeVault 2021-11-16 19:47 ` Andrew Morton 2021-11-16 18:36 ` Matthew Wilcox 2021-11-17 22:26 ` Johannes Weiner 2 siblings, 1 reply; 58+ messages in thread From: Drew DeVault @ 2021-11-16 6:32 UTC (permalink / raw) To: Andrew Morton, Ammar Faizi Cc: linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue Nov 16, 2021 at 5:35 AM CET, Andrew Morton wrote: > Unfortunately I didn't know about this until Nov 4, which was formally > too late for 5.16. I guess I could try to sneak it past Linus if > someone were to send me some sufficiently convincing words explaining > the urgency. I don't think it's that urgent, but I also wouldn't protest if someone wants to usher it in sooner rather than later. > And a question: rather than messing around with a constant which will > need to be increased again in a couple of years, can we solve this one > and for all? For example, permit root to set the system-wide > per-process max mlock size and depend upon initscripts to do this > appropriately. Not sure I understand. Root and init scripts can already manage this number - the goal of this patch is just to provide a saner default. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 6:32 ` Drew DeVault @ 2021-11-16 19:47 ` Andrew Morton 2021-11-16 19:48 ` Drew DeVault 0 siblings, 1 reply; 58+ messages in thread From: Andrew Morton @ 2021-11-16 19:47 UTC (permalink / raw) To: Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue, 16 Nov 2021 07:32:33 +0100 "Drew DeVault" <[email protected]> wrote: > > And a question: rather than messing around with a constant which will > > need to be increased again in a couple of years, can we solve this one > > and for all? For example, permit root to set the system-wide > > per-process max mlock size and depend upon initscripts to do this > > appropriately. > > Not sure I understand. Root and init scripts can already manage this > number - the goal of this patch is just to provide a saner default. Well, why change the default? Surely anyone who cares is altering it at runtime anyway. And if they are not, we should encourage them to do so? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 19:47 ` Andrew Morton @ 2021-11-16 19:48 ` Drew DeVault 2021-11-16 21:37 ` Andrew Morton 0 siblings, 1 reply; 58+ messages in thread From: Drew DeVault @ 2021-11-16 19:48 UTC (permalink / raw) To: Andrew Morton Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote: > Well, why change the default? Surely anyone who cares is altering it > at runtime anyway. And if they are not, we should encourage them to do > so? I addressed this question in the original patch's commit message. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 19:48 ` Drew DeVault @ 2021-11-16 21:37 ` Andrew Morton 2021-11-17 8:23 ` Drew DeVault 2021-11-22 17:11 ` David Hildenbrand 0 siblings, 2 replies; 58+ messages in thread From: Andrew Morton @ 2021-11-16 21:37 UTC (permalink / raw) To: Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue, 16 Nov 2021 20:48:48 +0100 "Drew DeVault" <[email protected]> wrote: > On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote: > > Well, why change the default? Surely anyone who cares is altering it > > at runtime anyway. And if they are not, we should encourage them to do > > so? > > I addressed this question in the original patch's commit message. Kinda. We're never going to get this right, are we? The only person who can decide on a system's appropriate setting is the operator of that system. Haphazardly increasing the limit every few years mainly reduces incentive for people to get this right. And people who test their software on 5.17 kernels will later find that it doesn't work on 5.16 and earlier, so they still need to tell their users to configure their systems appropriately. Until 5.16 is obsolete, by which time we're looking at increasing the default again. I don't see how this change gets us closer to the desired state: getting distros and their users to configure their systems appropriately. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 21:37 ` Andrew Morton @ 2021-11-17 8:23 ` Drew DeVault 2021-11-22 17:11 ` David Hildenbrand 1 sibling, 0 replies; 58+ messages in thread From: Drew DeVault @ 2021-11-17 8:23 UTC (permalink / raw) To: Andrew Morton Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue Nov 16, 2021 at 10:37 PM CET, Andrew Morton wrote: > We're never going to get this right, are we? The only person who can > decide on a system's appropriate setting is the operator of that > system. Haphazardly increasing the limit every few years mainly > reduces incentive for people to get this right. > > And people who test their software on 5.17 kernels will later find that > it doesn't work on 5.16 and earlier, so they still need to tell their > users to configure their systems appropriately. Until 5.16 is > obsolete, by which time we're looking at increasing the default again. > > I don't see how this change gets us closer to the desired state: > getting distros and their users to configure their systems > appropriately. Perfect is the enemy of good. This is a very simple change we can make to improve the status quo, and I think that's worth doing. I do not have time to develop a more sophisticated solution which steers the defaults based on memory present, and I definitely don't have the time to petition every distro to configure a better default for their particular needs. This is the easiest way to get broad adoption for a better default. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 21:37 ` Andrew Morton 2021-11-17 8:23 ` Drew DeVault @ 2021-11-22 17:11 ` David Hildenbrand 2021-11-22 17:55 ` Andrew Dona-Couch 1 sibling, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-22 17:11 UTC (permalink / raw) To: Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On 16.11.21 22:37, Andrew Morton wrote: > On Tue, 16 Nov 2021 20:48:48 +0100 "Drew DeVault" <[email protected]> wrote: > >> On Tue Nov 16, 2021 at 8:47 PM CET, Andrew Morton wrote: >>> Well, why change the default? Surely anyone who cares is altering it >>> at runtime anyway. And if they are not, we should encourage them to do >>> so? >> >> I addressed this question in the original patch's commit message. > > Kinda. > > We're never going to get this right, are we? The only person who can > decide on a system's appropriate setting is the operator of that > system. Haphazardly increasing the limit every few years mainly > reduces incentive for people to get this right. > > And people who test their software on 5.17 kernels will later find that > it doesn't work on 5.16 and earlier, so they still need to tell their > users to configure their systems appropriately. Until 5.16 is > obsolete, by which time we're looking at increasing the default again. > > I don't see how this change gets us closer to the desired state: > getting distros and their users to configure their systems > appropriately. > My 2 cents: while we should actually try to avoid new FOLL_LONGTERM users where possible, we introduce more (IOURING_REGISTER_BUFFERS) to be consumed by ordinary, unprivileged users. These new features, *when used* require us to raise the MLOCK_LIMIT. Secretmem is similar, but for now it rather "replaces" old mlock usage and IIRC has similarly small memory demands; that might change in the future, though. Why is FOLL_LONGTERM bad? Not only does it prevent swapping like mlock does, the pages are also unmovable in memory, such that they cannot be moved around, for example, for memory compaction. Well, I'm not too mad about IOURING_REGISTER_BUFFERS, it actually helped me to write a simple reproducer for the COW issues we have in upstream mm, and can be quite beneficial in some setups. Still, I think it should be used with care depending on the actual environment. So, just because a new feature is around that could be used, does it mean that we should adjust our kernel default? I'd say in this case, rather not. Distributions, or much better, the responsible admin, should make such decisions, knowing the environment and the effect this could have. (I know that we can similarly trigger allocation of a lot of unmovable memory using other means by malicious user space; but that is rather something to limit or handle in the future IMHO) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 17:11 ` David Hildenbrand @ 2021-11-22 17:55 ` Andrew Dona-Couch 2021-11-22 18:26 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Andrew Dona-Couch @ 2021-11-22 17:55 UTC (permalink / raw) To: David Hildenbrand, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm Forgive me for jumping in to an already overburdened thread. But can someone pushing back on this clearly explain the issue with applying this patch? The only concerns I've heard are that it doesn't go far enough. That another strategy (that everyone seems to agree would be a fair bit more effort) could potentially achieve the same goal and then some. Isn't that exactly what's meant by "don't let perfection be the enemy of the good"? The saying is not talking about literal perfection -- the idea is that you make progress where you can, and that incremental progress and broader changes are not necessarily in conflict. This tiny patch could be a step in the right direction. Why does this thread need dozens of replies? Thanks, Andrew -- We all do better when we all do better. -Paul Wellstone ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 17:55 ` Andrew Dona-Couch @ 2021-11-22 18:26 ` David Hildenbrand 2021-11-22 19:53 ` Jens Axboe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-22 18:26 UTC (permalink / raw) To: Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On 22.11.21 18:55, Andrew Dona-Couch wrote: > Forgive me for jumping in to an already overburdened thread. But can > someone pushing back on this clearly explain the issue with applying > this patch? It will allow unprivileged users to easily and even "accidentally" allocate more unmovable memory than it should in some environments. Such limits exist for a reason. And there are ways for admins/distros to tweak these limits if they know what they are doing. > > The only concerns I've heard are that it doesn't go far enough. That > another strategy (that everyone seems to agree would be a fair bit more > effort) could potentially achieve the same goal and then some. Isn't > that exactly what's meant by "don't let perfection be the enemy of the > good"? The saying is not talking about literal perfection -- the idea is > that you make progress where you can, and that incremental progress and > broader changes are not necessarily in conflict. > > This tiny patch could be a step in the right direction. Why does this > thread need dozens of replies? Because it does something controversial. Send controversial patches, receive many opinions, it's that simple. This is not a step into the right direction. This is all just trying to hide the fact that we're exposing FOLL_LONGTERM usage to random unprivileged users. Maybe we could instead try getting rid of FOLL_LONGTERM usage and the memlock limit in io_uring altogether, for example, by using mmu notifiers. But I'm no expert on the io_uring code. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 18:26 ` David Hildenbrand @ 2021-11-22 19:53 ` Jens Axboe 2021-11-22 20:03 ` Matthew Wilcox 2021-11-22 20:08 ` David Hildenbrand 0 siblings, 2 replies; 58+ messages in thread From: Jens Axboe @ 2021-11-22 19:53 UTC (permalink / raw) To: David Hildenbrand, Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/22/21 11:26 AM, David Hildenbrand wrote: > On 22.11.21 18:55, Andrew Dona-Couch wrote: >> Forgive me for jumping in to an already overburdened thread. But can >> someone pushing back on this clearly explain the issue with applying >> this patch? > > It will allow unprivileged users to easily and even "accidentally" > allocate more unmovable memory than it should in some environments. Such > limits exist for a reason. And there are ways for admins/distros to > tweak these limits if they know what they are doing. But that's entirely the point, the cases where this change is needed are already screwed by a distro and the user is the administrator. This is _exactly_ the case where things should just work out of the box. If you're managing farms of servers, yeah you have competent administration and you can be expected to tweak settings to get the best experience and performance, but the kernel should provide a sane default. 64K isn't a sane default. > This is not a step into the right direction. This is all just trying to > hide the fact that we're exposing FOLL_LONGTERM usage to random > unprivileged users. > > Maybe we could instead try getting rid of FOLL_LONGTERM usage and the > memlock limit in io_uring altogether, for example, by using mmu > notifiers. But I'm no expert on the io_uring code. You can't use mmu notifiers without impacting the fast path. This isn't just about io_uring, there are other users of memlock right now (like bpf) which just makes it even worse. We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something like what was suggested, if that will help move things forward. IMHO the 32MB machine is mostly a theoretical case, but whatever . -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 19:53 ` Jens Axboe @ 2021-11-22 20:03 ` Matthew Wilcox 2021-11-22 20:04 ` Jens Axboe 2021-11-22 20:08 ` David Hildenbrand 1 sibling, 1 reply; 58+ messages in thread From: Matthew Wilcox @ 2021-11-22 20:03 UTC (permalink / raw) To: Jens Axboe Cc: David Hildenbrand, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Mon, Nov 22, 2021 at 12:53:31PM -0700, Jens Axboe wrote: > We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something > like what was suggested, if that will help move things forward. IMHO the > 32MB machine is mostly a theoretical case, but whatever . I think you mean max(0.1% ram, 64KB). with that change, I agree. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 20:03 ` Matthew Wilcox @ 2021-11-22 20:04 ` Jens Axboe 0 siblings, 0 replies; 58+ messages in thread From: Jens Axboe @ 2021-11-22 20:04 UTC (permalink / raw) To: Matthew Wilcox Cc: David Hildenbrand, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/22/21 1:03 PM, Matthew Wilcox wrote: > On Mon, Nov 22, 2021 at 12:53:31PM -0700, Jens Axboe wrote: >> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >> like what was suggested, if that will help move things forward. IMHO the >> 32MB machine is mostly a theoretical case, but whatever . > > I think you mean max(0.1% ram, 64KB). with that change, I agree. Heh yes, that is indeed what I meant :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 19:53 ` Jens Axboe 2021-11-22 20:03 ` Matthew Wilcox @ 2021-11-22 20:08 ` David Hildenbrand 2021-11-22 20:44 ` Jens Axboe 2021-11-23 13:25 ` Jason Gunthorpe 1 sibling, 2 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-22 20:08 UTC (permalink / raw) To: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 22.11.21 20:53, Jens Axboe wrote: > On 11/22/21 11:26 AM, David Hildenbrand wrote: >> On 22.11.21 18:55, Andrew Dona-Couch wrote: >>> Forgive me for jumping in to an already overburdened thread. But can >>> someone pushing back on this clearly explain the issue with applying >>> this patch? >> >> It will allow unprivileged users to easily and even "accidentally" >> allocate more unmovable memory than it should in some environments. Such >> limits exist for a reason. And there are ways for admins/distros to >> tweak these limits if they know what they are doing. > > But that's entirely the point, the cases where this change is needed are > already screwed by a distro and the user is the administrator. This is > _exactly_ the case where things should just work out of the box. If > you're managing farms of servers, yeah you have competent administration > and you can be expected to tweak settings to get the best experience and > performance, but the kernel should provide a sane default. 64K isn't a > sane default. 0.1% of RAM isn't either. > >> This is not a step into the right direction. This is all just trying to >> hide the fact that we're exposing FOLL_LONGTERM usage to random >> unprivileged users. >> >> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the >> memlock limit in io_uring altogether, for example, by using mmu >> notifiers. But I'm no expert on the io_uring code. > > You can't use mmu notifiers without impacting the fast path. This isn't > just about io_uring, there are other users of memlock right now (like > bpf) which just makes it even worse. 1) Do we have a performance evaluation? Did someone try and come up with a conclusion how bad it would be? 2) Could be provide a mmu variant to ordinary users that's just good enough but maybe not as fast as what we have today? And limit FOLL_LONGTERM to special, privileged users? 3) Just because there are other memlock users is not an excuse. For example, VFIO/VDPA have to use it for a reason, because there is no way not do use FOLL_LONGTERM. > > We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something > like what was suggested, if that will help move things forward. IMHO the > 32MB machine is mostly a theoretical case, but whatever . 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges where FOLL_LONGTERM cannot be used, as that memory is not available. 2) With 0.1% RAM it's sufficient to start 1000 processes to break any system completely and deeply mess up the MM. Oh my. No, I don't like this, absolutely not. I neither like raising the memlock limit as default to such high values nor using FOLL_LONGTERM in cases where it could be avoided for random, unprivileged users. But I assume this is mostly for the records, because I assume nobody cares about my opinion here. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 20:08 ` David Hildenbrand @ 2021-11-22 20:44 ` Jens Axboe 2021-11-22 21:56 ` David Hildenbrand 2021-11-23 13:25 ` Jason Gunthorpe 1 sibling, 1 reply; 58+ messages in thread From: Jens Axboe @ 2021-11-22 20:44 UTC (permalink / raw) To: David Hildenbrand, Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/22/21 1:08 PM, David Hildenbrand wrote: > On 22.11.21 20:53, Jens Axboe wrote: >> On 11/22/21 11:26 AM, David Hildenbrand wrote: >>> On 22.11.21 18:55, Andrew Dona-Couch wrote: >>>> Forgive me for jumping in to an already overburdened thread. But can >>>> someone pushing back on this clearly explain the issue with applying >>>> this patch? >>> >>> It will allow unprivileged users to easily and even "accidentally" >>> allocate more unmovable memory than it should in some environments. Such >>> limits exist for a reason. And there are ways for admins/distros to >>> tweak these limits if they know what they are doing. >> >> But that's entirely the point, the cases where this change is needed are >> already screwed by a distro and the user is the administrator. This is >> _exactly_ the case where things should just work out of the box. If >> you're managing farms of servers, yeah you have competent administration >> and you can be expected to tweak settings to get the best experience and >> performance, but the kernel should provide a sane default. 64K isn't a >> sane default. > > 0.1% of RAM isn't either. No default is perfect, byt 0.1% will solve 99% of the problem. And most likely solve 100% of the problems for the important case, which is where you want things to Just Work on your distro without doing any administration. If you're aiming for perfection, it doesn't exist. >>> This is not a step into the right direction. This is all just trying to >>> hide the fact that we're exposing FOLL_LONGTERM usage to random >>> unprivileged users. >>> >>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the >>> memlock limit in io_uring altogether, for example, by using mmu >>> notifiers. But I'm no expert on the io_uring code. >> >> You can't use mmu notifiers without impacting the fast path. This isn't >> just about io_uring, there are other users of memlock right now (like >> bpf) which just makes it even worse. > > 1) Do we have a performance evaluation? Did someone try and come up with > a conclusion how bad it would be? I honestly don't remember the details, I took a look at it about a year ago due to some unrelated reasons. These days it just pertains to registered buffers, so it's less of an issue than back then when it dealt with the rings as well. Hence might be feasible, I'm certainly not against anyone looking into it. Easy enough to review and test for performance concerns. > 2) Could be provide a mmu variant to ordinary users that's just good > enough but maybe not as fast as what we have today? And limit > FOLL_LONGTERM to special, privileged users? If it's not as fast, then it's most likely not good enough though... > 3) Just because there are other memlock users is not an excuse. For > example, VFIO/VDPA have to use it for a reason, because there is no way > not do use FOLL_LONGTERM. It's not an excuse, the statement merely means that the problem is _worse_ as there are other memlock users. >> >> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >> like what was suggested, if that will help move things forward. IMHO the >> 32MB machine is mostly a theoretical case, but whatever . > > 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges > where FOLL_LONGTERM cannot be used, as that memory is not available. > > 2) With 0.1% RAM it's sufficient to start 1000 processes to break any > system completely and deeply mess up the MM. Oh my. We're talking per-user limits here. But if you want to talk hyperbole, then 64K multiplied by some other random number will also allow everything to be pinned, potentially. -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 20:44 ` Jens Axboe @ 2021-11-22 21:56 ` David Hildenbrand 2021-11-23 12:02 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-22 21:56 UTC (permalink / raw) To: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 22.11.21 21:44, Jens Axboe wrote: > On 11/22/21 1:08 PM, David Hildenbrand wrote: >> On 22.11.21 20:53, Jens Axboe wrote: >>> On 11/22/21 11:26 AM, David Hildenbrand wrote: >>>> On 22.11.21 18:55, Andrew Dona-Couch wrote: >>>>> Forgive me for jumping in to an already overburdened thread. But can >>>>> someone pushing back on this clearly explain the issue with applying >>>>> this patch? >>>> >>>> It will allow unprivileged users to easily and even "accidentally" >>>> allocate more unmovable memory than it should in some environments. Such >>>> limits exist for a reason. And there are ways for admins/distros to >>>> tweak these limits if they know what they are doing. >>> >>> But that's entirely the point, the cases where this change is needed are >>> already screwed by a distro and the user is the administrator. This is >>> _exactly_ the case where things should just work out of the box. If >>> you're managing farms of servers, yeah you have competent administration >>> and you can be expected to tweak settings to get the best experience and >>> performance, but the kernel should provide a sane default. 64K isn't a >>> sane default. >> >> 0.1% of RAM isn't either. > > No default is perfect, byt 0.1% will solve 99% of the problem. And most > likely solve 100% of the problems for the important case, which is where > you want things to Just Work on your distro without doing any > administration. If you're aiming for perfection, it doesn't exist. ... and my Fedora is already at 16 MiB *sigh*. And I'm not aiming for perfection, I'm aiming for as little FOLL_LONGTERM users as possible ;) > >>>> This is not a step into the right direction. This is all just trying to >>>> hide the fact that we're exposing FOLL_LONGTERM usage to random >>>> unprivileged users. >>>> >>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the >>>> memlock limit in io_uring altogether, for example, by using mmu >>>> notifiers. But I'm no expert on the io_uring code. >>> >>> You can't use mmu notifiers without impacting the fast path. This isn't >>> just about io_uring, there are other users of memlock right now (like >>> bpf) which just makes it even worse. >> >> 1) Do we have a performance evaluation? Did someone try and come up with >> a conclusion how bad it would be? > > I honestly don't remember the details, I took a look at it about a year > ago due to some unrelated reasons. These days it just pertains to > registered buffers, so it's less of an issue than back then when it > dealt with the rings as well. Hence might be feasible, I'm certainly not > against anyone looking into it. Easy enough to review and test for > performance concerns. That at least sounds promising. > >> 2) Could be provide a mmu variant to ordinary users that's just good >> enough but maybe not as fast as what we have today? And limit >> FOLL_LONGTERM to special, privileged users? > > If it's not as fast, then it's most likely not good enough though... There is always a compromise of course. See, FOLL_LONGTERM is *the worst* kind of memory allocation thingy you could possible do to your MM subsystem. It's absolutely the worst thing you can do to swap and compaction. I really don't want random feature X to be next and say "well, io_uring uses it, so I can just use it for max performance and we'll adjust the memlock limit, who cares!". > >> 3) Just because there are other memlock users is not an excuse. For >> example, VFIO/VDPA have to use it for a reason, because there is no way >> not do use FOLL_LONGTERM. > > It's not an excuse, the statement merely means that the problem is > _worse_ as there are other memlock users. Yes, and it will keep getting worse every time we introduce more FOLL_LONGTERM users that really shouldn't be FOLL_LONGTERM users unless really required. Again, VFIO/VDPA/RDMA are prime examples, because the HW forces us to do it. And these are privileged features either way. > >>> >>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >>> like what was suggested, if that will help move things forward. IMHO the >>> 32MB machine is mostly a theoretical case, but whatever . >> >> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges >> where FOLL_LONGTERM cannot be used, as that memory is not available. >> >> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any >> system completely and deeply mess up the MM. Oh my. > > We're talking per-user limits here. But if you want to talk hyperbole, > then 64K multiplied by some other random number will also allow > everything to be pinned, potentially. > Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in the worst case. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 21:56 ` David Hildenbrand @ 2021-11-23 12:02 ` David Hildenbrand 0 siblings, 0 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-23 12:02 UTC (permalink / raw) To: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm >>>> >>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >>>> like what was suggested, if that will help move things forward. IMHO the >>>> 32MB machine is mostly a theoretical case, but whatever . >>> >>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges >>> where FOLL_LONGTERM cannot be used, as that memory is not available. >>> >>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any >>> system completely and deeply mess up the MM. Oh my. >> >> We're talking per-user limits here. But if you want to talk hyperbole, >> then 64K multiplied by some other random number will also allow >> everything to be pinned, potentially. >> > > Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in > the worst case. > To make it clear why I keep complaining about FOLL_LONGTERM for unprivileged users even if we're talking about "only" 0.1% of RAM ... On x86-64 a 2 MiB THP (IOW pageblock) has 512 sub-pages. If we manage to FOLL_LONGTERM a single sub-page, we can make the THP unavailable to the system, meaning we cannot form a THP by compaction/swapping/migration/whatever at that physical memory area until we unpin that single page. We essentially "block" a THP from forming at that physical memory area. So with a single 4k page we can block one 2 MiB THP. With 0.1% we can, therefore, block 51,2 % of all THP. Theoretically, of course, if the stars align. ... or if we're malicious or unlucky. I wrote a reproducer this morning that tries blocking as many THP as it can: https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c ------------------------------------------------------------------------ Example on my 16 GiB (8096 THP "in theory") notebook with some applications running in the background. $ uname -a Linux t480s 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ ./io_uring_thp PAGE size: 4096 bytes (sensed) THP size: 2097152 bytes (sensed) RLIMIT_MEMLOCK: 16777216 bytes (sensed) IORING_MAX_REG_BUFFERS: 16384 (guess) Pages per THP: 512 User can block 4096 THP (8589934592 bytes) Process can block 4096 THP (8589934592 bytes) Blocking 1 THP Blocking 2 THP ... Blocking 3438 THP Blocking 3439 THP Blocking 3440 THP Blocking 3441 THP Blocking 3442 THP ... and after a while Blocking 4093 THP Blocking 4094 THP Blocking 4095 THP Blocking 4096 THP $ cat /proc/`pgrep io_uring_thp`/status Name: io_uring_thp Umask: 0002 State: S (sleeping) [...] VmPeak: 6496 kB VmSize: 6496 kB VmLck: 0 kB VmPin: 16384 kB VmHWM: 3628 kB VmRSS: 1580 kB RssAnon: 160 kB RssFile: 1420 kB RssShmem: 0 kB VmData: 4304 kB VmStk: 136 kB VmExe: 8 kB VmLib: 1488 kB VmPTE: 48 kB VmSwap: 0 kB HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 $ cat /proc/meminfo MemTotal: 16250920 kB MemFree: 11648016 kB MemAvailable: 11972196 kB Buffers: 50480 kB Cached: 1156768 kB SwapCached: 54680 kB Active: 704788 kB Inactive: 3477576 kB Active(anon): 427716 kB Inactive(anon): 3207604 kB Active(file): 277072 kB Inactive(file): 269972 kB ... Mlocked: 5692 kB SwapTotal: 8200188 kB SwapFree: 7742716 kB ... AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Let's see how many contiguous 2M pages we can still get as root: $ echo 1 > /proc/sys/vm/compact_memory $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 0 $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 537 ... keep retrying a couple of times $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 583 Let's kill the io_uring process and try again: $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4766 ... keep retrying a couple of times $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4823 ------------------------------------------------------------------------ I'm going to leave judgment how bad this is or isn't to the educated reader, and I'll stop spending time on this as I have more important things to work on. To summarize my humble opinion: 1) I am not against raising the default memlock limit if it's for a sane use case. While mlock itself can be somewhat bad for swap, FOLL_LONGTERM that also checks the memlock limit here is the real issue. This patch explicitly states the "IOURING_REGISTER_BUFFERS" use case, though, and that makes me nervous. 2) Exposing FOLL_LONGTERM to unprivileged users should be avoided best we can; in an ideal world, we wouldn't have it at all; in a sub-optimal world we'd have it only for use cases that really require it due to HW limitations. Ideally we'd even have yet another limit for this, because mlock != FOLL_LONGTERM. 3) IOURING_REGISTER_BUFFERS shouldn't use FOLL_LONGTERM for use by unprivileged users. We should provide a variant that doesn't rely on FOLL_LONGTERM or even rely on the memlock limit. Sorry to the patch author for bringing it up as response to the patch. After this patch just does what some distros already do (many distros even provide higher limits than 8 MiB!). I would be curious why some distros already have such high values ... and if it's already because of IOURING_REGISTER_BUFFERS after all. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-22 20:08 ` David Hildenbrand 2021-11-22 20:44 ` Jens Axboe @ 2021-11-23 13:25 ` Jason Gunthorpe 2021-11-23 13:39 ` David Hildenbrand 1 sibling, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-23 13:25 UTC (permalink / raw) To: David Hildenbrand Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Mon, Nov 22, 2021 at 09:08:47PM +0100, David Hildenbrand wrote: > > You can't use mmu notifiers without impacting the fast path. This isn't > > just about io_uring, there are other users of memlock right now (like > > bpf) which just makes it even worse. > > 1) Do we have a performance evaluation? Did someone try and come up with > a conclusion how bad it would be? It needs additional locking between page readers and the mmu notifier. One of the virtio things does this thing and they used rcu on the page readers and a synchronize rcu in a mmu notifier - which I think is pretty bad. > 2) Could be provide a mmu variant to ordinary users that's just good > enough but maybe not as fast as what we have today? And limit > FOLL_LONGTERM to special, privileged users? rdma has never been privileged Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 13:25 ` Jason Gunthorpe @ 2021-11-23 13:39 ` David Hildenbrand 2021-11-23 14:07 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-23 13:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm > >> 2) Could be provide a mmu variant to ordinary users that's just good >> enough but maybe not as fast as what we have today? And limit >> FOLL_LONGTERM to special, privileged users? > > rdma has never been privileged Feel free to correct me if I'm wrong: it requires special networking hardware and the admin/kernel has to prepare the system in a way such that it can be used. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 13:39 ` David Hildenbrand @ 2021-11-23 14:07 ` Jason Gunthorpe 2021-11-23 14:44 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-23 14:07 UTC (permalink / raw) To: David Hildenbrand Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote: > > > >> 2) Could be provide a mmu variant to ordinary users that's just good > >> enough but maybe not as fast as what we have today? And limit > >> FOLL_LONGTERM to special, privileged users? > > > > rdma has never been privileged > > Feel free to correct me if I'm wrong: it requires special networking > hardware and the admin/kernel has to prepare the system in a way such > that it can be used. Not really, plug in the right PCI card and it works "special" is a bit of a reach since almost every NIC sold in the > 100GB segment supports some RDMA. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 14:07 ` Jason Gunthorpe @ 2021-11-23 14:44 ` David Hildenbrand 2021-11-23 17:00 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-23 14:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 23.11.21 15:07, Jason Gunthorpe wrote: > On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote: >>> >>>> 2) Could be provide a mmu variant to ordinary users that's just good >>>> enough but maybe not as fast as what we have today? And limit >>>> FOLL_LONGTERM to special, privileged users? >>> >>> rdma has never been privileged >> >> Feel free to correct me if I'm wrong: it requires special networking >> hardware and the admin/kernel has to prepare the system in a way such >> that it can be used. > > Not really, plug in the right PCI card and it works Naive me would have assumed that the right modules have to be loaded (and not blacklisted), that there has to be an rdma service installed and running, that the NIC has to be configured in some way, and that there is some kind of access control which user can actually use which NIC. For example, I would have assume from inside a container it usually wouldn't just work. But I am absolutely not a networking and RDMA expert, so I have to believe what you say and I trust your experience :) So could as well be that on such a "special" (or not so special) systems there should be a way to restrict it to privileged users only. > > "special" is a bit of a reach since almost every NIC sold in the > 100GB > segment supports some RDMA. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 14:44 ` David Hildenbrand @ 2021-11-23 17:00 ` Jason Gunthorpe 2021-11-23 17:04 ` David Hildenbrand 2021-11-23 22:04 ` Vlastimil Babka 0 siblings, 2 replies; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-23 17:00 UTC (permalink / raw) To: David Hildenbrand Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue, Nov 23, 2021 at 03:44:03PM +0100, David Hildenbrand wrote: > On 23.11.21 15:07, Jason Gunthorpe wrote: > > On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote: > >>> > >>>> 2) Could be provide a mmu variant to ordinary users that's just good > >>>> enough but maybe not as fast as what we have today? And limit > >>>> FOLL_LONGTERM to special, privileged users? > >>> > >>> rdma has never been privileged > >> > >> Feel free to correct me if I'm wrong: it requires special networking > >> hardware and the admin/kernel has to prepare the system in a way such > >> that it can be used. > > > > Not really, plug in the right PCI card and it works > > Naive me would have assumed that the right modules have to be loaded > (and not blacklisted), that there has to be an rdma service installed > and running, that the NIC has to be configured in some way, and that > there is some kind of access control which user can actually use which > NIC. Not really, we've worked hard that it works as well as any other HW device. Plug it in and it works. There is no systemd service, or special mandatory configuration, for instance. > For example, I would have assume from inside a container it usually > wouldn't just work. Nope, RDMA follows the net namespaces of its ethernet port, so it just works in containers too. > believe what you say and I trust your experience :) So could as well be > that on such a "special" (or not so special) systems there should be a > way to restrict it to privileged users only. At this point RDMA is about as "special" as people running large ZONE_MOVABLE systems, and the two are going to start colliding heavily. The RDMA VFIO migration driver should be merged soon which makes VMs using this stuff finally practical. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 17:00 ` Jason Gunthorpe @ 2021-11-23 17:04 ` David Hildenbrand 2021-11-23 22:04 ` Vlastimil Babka 1 sibling, 0 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-23 17:04 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 23.11.21 18:00, Jason Gunthorpe wrote: > On Tue, Nov 23, 2021 at 03:44:03PM +0100, David Hildenbrand wrote: >> On 23.11.21 15:07, Jason Gunthorpe wrote: >>> On Tue, Nov 23, 2021 at 02:39:19PM +0100, David Hildenbrand wrote: >>>>> >>>>>> 2) Could be provide a mmu variant to ordinary users that's just good >>>>>> enough but maybe not as fast as what we have today? And limit >>>>>> FOLL_LONGTERM to special, privileged users? >>>>> >>>>> rdma has never been privileged >>>> >>>> Feel free to correct me if I'm wrong: it requires special networking >>>> hardware and the admin/kernel has to prepare the system in a way such >>>> that it can be used. >>> >>> Not really, plug in the right PCI card and it works >> >> Naive me would have assumed that the right modules have to be loaded >> (and not blacklisted), that there has to be an rdma service installed >> and running, that the NIC has to be configured in some way, and that >> there is some kind of access control which user can actually use which >> NIC. > > Not really, we've worked hard that it works as well as any other HW > device. Plug it in and it works. > > There is no systemd service, or special mandatory configuration, for > instance. > >> For example, I would have assume from inside a container it usually >> wouldn't just work. > > Nope, RDMA follows the net namespaces of its ethernet port, so it just > works in containers too. > >> believe what you say and I trust your experience :) So could as well be >> that on such a "special" (or not so special) systems there should be a >> way to restrict it to privileged users only. > > At this point RDMA is about as "special" as people running large > ZONE_MOVABLE systems, and the two are going to start colliding > heavily. The RDMA VFIO migration driver should be merged soon which > makes VMs using this stuff finally practical. Sounds like fun. At least we documented it already ;) https://www.kernel.org/doc/html/latest/admin-guide/mm/memory-hotplug.html#zone-movable-sizing-considerations -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 17:00 ` Jason Gunthorpe 2021-11-23 17:04 ` David Hildenbrand @ 2021-11-23 22:04 ` Vlastimil Babka 2021-11-23 23:59 ` Jason Gunthorpe 1 sibling, 1 reply; 58+ messages in thread From: Vlastimil Babka @ 2021-11-23 22:04 UTC (permalink / raw) To: Jason Gunthorpe, David Hildenbrand Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/23/21 18:00, Jason Gunthorpe wrote: > >> believe what you say and I trust your experience :) So could as well be >> that on such a "special" (or not so special) systems there should be a >> way to restrict it to privileged users only. > > At this point RDMA is about as "special" as people running large > ZONE_MOVABLE systems, and the two are going to start colliding > heavily. The RDMA VFIO migration driver should be merged soon which > makes VMs using this stuff finally practical. How does that work, I see the word migration, so does it cause pages to be migrated out of ZONE_MOVABLE before they are pinned? Similarly for io-uring we could be migrating pages to be pinned so that the end up consolidated close together, and prevent pathologic situations like in David's reproducer. IIRC that was a idea to do for long-term pins in general. > Jason > ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 22:04 ` Vlastimil Babka @ 2021-11-23 23:59 ` Jason Gunthorpe 2021-11-24 8:57 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-23 23:59 UTC (permalink / raw) To: Vlastimil Babka Cc: David Hildenbrand, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue, Nov 23, 2021 at 11:04:04PM +0100, Vlastimil Babka wrote: > On 11/23/21 18:00, Jason Gunthorpe wrote: > > > >> believe what you say and I trust your experience :) So could as well be > >> that on such a "special" (or not so special) systems there should be a > >> way to restrict it to privileged users only. > > > > At this point RDMA is about as "special" as people running large > > ZONE_MOVABLE systems, and the two are going to start colliding > > heavily. The RDMA VFIO migration driver should be merged soon which > > makes VMs using this stuff finally practical. > > How does that work, I see the word migration, so does it cause pages to Sorry I mean what is often called "VM live migration". Typically that cannot be done if a PCI device is assigned to the VM as suspending and the migrating a PCI device to another server is complicated. With forthcoming hardware mlx5 can do this and thus the entire RDMA stack becomes practically usable and performant within a VM. > be migrated out of ZONE_MOVABLE before they are pinned? GUP already does this automatically for FOLL_LONGTERM. > Similarly for io-uring we could be migrating pages to be pinned so that > the end up consolidated close together, and prevent pathologic > situations like in David's reproducer. It is an interesting idea to have GUP do some kind of THP preserving migration. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-23 23:59 ` Jason Gunthorpe @ 2021-11-24 8:57 ` David Hildenbrand 2021-11-24 13:23 ` Jason Gunthorpe 2021-11-24 14:37 ` Vlastimil Babka 0 siblings, 2 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 8:57 UTC (permalink / raw) To: Jason Gunthorpe, Vlastimil Babka Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 00:59, Jason Gunthorpe wrote: > On Tue, Nov 23, 2021 at 11:04:04PM +0100, Vlastimil Babka wrote: >> On 11/23/21 18:00, Jason Gunthorpe wrote: >>> >>>> believe what you say and I trust your experience :) So could as well be >>>> that on such a "special" (or not so special) systems there should be a >>>> way to restrict it to privileged users only. >>> >>> At this point RDMA is about as "special" as people running large >>> ZONE_MOVABLE systems, and the two are going to start colliding >>> heavily. The RDMA VFIO migration driver should be merged soon which >>> makes VMs using this stuff finally practical. >> >> How does that work, I see the word migration, so does it cause pages to > > Sorry I mean what is often called "VM live migration". Typically that > cannot be done if a PCI device is assigned to the VM as suspending and > the migrating a PCI device to another server is complicated. With > forthcoming hardware mlx5 can do this and thus the entire RDMA stack > becomes practically usable and performant within a VM. > >> be migrated out of ZONE_MOVABLE before they are pinned? > > GUP already does this automatically for FOLL_LONGTERM. > >> Similarly for io-uring we could be migrating pages to be pinned so that >> the end up consolidated close together, and prevent pathologic >> situations like in David's reproducer. > > It is an interesting idea to have GUP do some kind of THP preserving > migration. Unfortunately it will only be a band aid AFAIU. I can rewrite my reproducer fairly easily to pin the whole 2M range first, pin a second time only a single page, and then unpin the 2M range, resulting in the very same way to block THP. (I can block some THP less because I always need the possibility to memlock 2M first, though). -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 8:57 ` David Hildenbrand @ 2021-11-24 13:23 ` Jason Gunthorpe 2021-11-24 13:25 ` David Hildenbrand 2021-11-24 14:37 ` Vlastimil Babka 1 sibling, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 13:23 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: > Unfortunately it will only be a band aid AFAIU. I can rewrite my > reproducer fairly easily to pin the whole 2M range first, pin a second > time only a single page, and then unpin the 2M range, resulting in the > very same way to block THP. (I can block some THP less because I always > need the possibility to memlock 2M first, though). Oh! The issue is GUP always pins an entire compound, no matter how little the user requests. However, when all the GUP callers do mlock accounting they have no idea how much memory GUP actually pinned and only account mlock on 4K chunks. This is the bug your test is showing - using this accounting error the user can significantly blow past their mlock limit by having GUP pin 2M chunks and then mlock accounting for only 4k chunks. It is a super obnoxious bug to fix, but still just a bug and not some inherent defect in FOLL_LONGTERM. It also says the MLOCK_LIMIT really needs to always be > 1 THP otherwise even a single 4K page may be unpinnable with correct accounting. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 13:23 ` Jason Gunthorpe @ 2021-11-24 13:25 ` David Hildenbrand 2021-11-24 13:28 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 13:25 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 14:23, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: > >> Unfortunately it will only be a band aid AFAIU. I can rewrite my >> reproducer fairly easily to pin the whole 2M range first, pin a second >> time only a single page, and then unpin the 2M range, resulting in the >> very same way to block THP. (I can block some THP less because I always >> need the possibility to memlock 2M first, though). > > Oh! > > The issue is GUP always pins an entire compound, no matter how little > the user requests. That's a different issue. I make sure to split the compound page before pinning anything :) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 13:25 ` David Hildenbrand @ 2021-11-24 13:28 ` Jason Gunthorpe 2021-11-24 13:29 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 13:28 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote: > On 24.11.21 14:23, Jason Gunthorpe wrote: > > On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: > > > >> Unfortunately it will only be a band aid AFAIU. I can rewrite my > >> reproducer fairly easily to pin the whole 2M range first, pin a second > >> time only a single page, and then unpin the 2M range, resulting in the > >> very same way to block THP. (I can block some THP less because I always > >> need the possibility to memlock 2M first, though). > > > > Oh! > > > > The issue is GUP always pins an entire compound, no matter how little > > the user requests. > > That's a different issue. I make sure to split the compound page before > pinning anything :) ?? Where is that done in GUP? Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 13:28 ` Jason Gunthorpe @ 2021-11-24 13:29 ` David Hildenbrand 2021-11-24 13:48 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 13:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 14:28, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote: >> On 24.11.21 14:23, Jason Gunthorpe wrote: >>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: >>> >>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my >>>> reproducer fairly easily to pin the whole 2M range first, pin a second >>>> time only a single page, and then unpin the 2M range, resulting in the >>>> very same way to block THP. (I can block some THP less because I always >>>> need the possibility to memlock 2M first, though). >>> >>> Oh! >>> >>> The issue is GUP always pins an entire compound, no matter how little >>> the user requests. >> >> That's a different issue. I make sure to split the compound page before >> pinning anything :) > > ?? Where is that done in GUP? It's done in my reproducer manually. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 13:29 ` David Hildenbrand @ 2021-11-24 13:48 ` Jason Gunthorpe 2021-11-24 14:14 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 13:48 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 02:29:38PM +0100, David Hildenbrand wrote: > On 24.11.21 14:28, Jason Gunthorpe wrote: > > On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote: > >> On 24.11.21 14:23, Jason Gunthorpe wrote: > >>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: > >>> > >>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my > >>>> reproducer fairly easily to pin the whole 2M range first, pin a second > >>>> time only a single page, and then unpin the 2M range, resulting in the > >>>> very same way to block THP. (I can block some THP less because I always > >>>> need the possibility to memlock 2M first, though). > >>> > >>> Oh! > >>> > >>> The issue is GUP always pins an entire compound, no matter how little > >>> the user requests. > >> > >> That's a different issue. I make sure to split the compound page before > >> pinning anything :) > > > > ?? Where is that done in GUP? > > It's done in my reproducer manually. Aren't there many ways for hostile unpriv userspace to cause memory fragmentation? You are picking on pinning here, but any approach that forces the kernel to make a kalloc on a THP subpage would do just as well. Arguably if we want to point to an issue here it is in MADV_FREE/etc that is the direct culprit in allowing userspace to break up THPs and then trigger fragmentation. If the objective is to prevent DOS of THP then MADV_FREE should conserve the THP and migrate the subpages to non-THP memory. FOLL_LONGTERM is not the issue here. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 13:48 ` Jason Gunthorpe @ 2021-11-24 14:14 ` David Hildenbrand 2021-11-24 15:34 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 14:14 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 14:48, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 02:29:38PM +0100, David Hildenbrand wrote: >> On 24.11.21 14:28, Jason Gunthorpe wrote: >>> On Wed, Nov 24, 2021 at 02:25:09PM +0100, David Hildenbrand wrote: >>>> On 24.11.21 14:23, Jason Gunthorpe wrote: >>>>> On Wed, Nov 24, 2021 at 09:57:32AM +0100, David Hildenbrand wrote: >>>>> >>>>>> Unfortunately it will only be a band aid AFAIU. I can rewrite my >>>>>> reproducer fairly easily to pin the whole 2M range first, pin a second >>>>>> time only a single page, and then unpin the 2M range, resulting in the >>>>>> very same way to block THP. (I can block some THP less because I always >>>>>> need the possibility to memlock 2M first, though). >>>>> >>>>> Oh! >>>>> >>>>> The issue is GUP always pins an entire compound, no matter how little >>>>> the user requests. >>>> >>>> That's a different issue. I make sure to split the compound page before >>>> pinning anything :) >>> >>> ?? Where is that done in GUP? >> >> It's done in my reproducer manually. > > Aren't there many ways for hostile unpriv userspace to cause memory > fragmentation? You are picking on pinning here, but any approach that > forces the kernel to make a kalloc on a THP subpage would do just as > well. I'm not aware of any where you can fragment 50% of all pageblocks in the system as an unprivileged user essentially consuming almost no memory and essentially staying inside well-defined memlock limits. But sure if there are "many" people will be able to come up with at least one comparable thing. I'll be happy to learn. > > Arguably if we want to point to an issue here it is in MADV_FREE/etc > that is the direct culprit in allowing userspace to break up THPs and > then trigger fragmentation. > > If the objective is to prevent DOS of THP then MADV_FREE should > conserve the THP and migrate the subpages to non-THP > memory. > > FOLL_LONGTERM is not the issue here. Thanks Jason for the discussion but this is where I'll opt out for now because we seem to strongly disagree and as I said: "I'm going to leave judgment how bad this is or isn't to the educated reader, and I'll stop spending time on this as I have more important things to work on." But I'm going to leave one last comment to eventually give you a different perspective: after MADV_DONTNEED the compound page sits on the deferred split queue and will get split either way soon. People are right now discussion upstream to even split synchronously, which would move MADV_FREE out of the picture completely. My position that FOLL_LONGTERM for unprivileged users is a strong no-go stands as it is. Not MADV_FREE speeding up the compound page split in my reproducer. Not MADV_DONTNEED allowing us to zap parts of a THP (I could even have just used munmap or even mmap(MAP_FIXED)). -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 14:14 ` David Hildenbrand @ 2021-11-24 15:34 ` Jason Gunthorpe 2021-11-24 16:43 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 15:34 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: > I'm not aware of any where you can fragment 50% of all pageblocks in the > system as an unprivileged user essentially consuming almost no memory > and essentially staying inside well-defined memlock limits. But sure if > there are "many" people will be able to come up with at least one > comparable thing. I'll be happy to learn. If the concern is that THP's can be DOS'd then any avenue that renders the system out of THPs is a DOS attack vector. Including all the normal workloads that people run and already complain that THPs get exhausted. A hostile userspace can only quicken this process. > My position that FOLL_LONGTERM for unprivileged users is a strong no-go > stands as it is. As this basically excludes long standing pre-existing things like RDMA, XDP, io_uring, and more I don't think this can be the general answer for mm, sorry. Sure, lets stop now since I don't think we can agree. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 15:34 ` Jason Gunthorpe @ 2021-11-24 16:43 ` David Hildenbrand 2021-11-24 18:35 ` Jason Gunthorpe 2021-11-24 18:37 ` David Hildenbrand 0 siblings, 2 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 16:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 16:34, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: > >> I'm not aware of any where you can fragment 50% of all pageblocks in the >> system as an unprivileged user essentially consuming almost no memory >> and essentially staying inside well-defined memlock limits. But sure if >> there are "many" people will be able to come up with at least one >> comparable thing. I'll be happy to learn. > > If the concern is that THP's can be DOS'd then any avenue that renders > the system out of THPs is a DOS attack vector. Including all the > normal workloads that people run and already complain that THPs get > exhausted. > > A hostile userspace can only quicken this process. We can not only fragment THP but also easily smaller compound pages, with less impact though (well, as long as people want more than 0.1% per user ...). We want to make more excessive use of THP; the whole folio work is about using THP. Some people are even working on increasing the MAX_ORDER and introduce gigantic THP. And here we are having mechanisms available to unprivileged users to just sabotage the very thing at its core extremely easily. Personally, I think this is very bad, but that's just my humble opinion. > >> My position that FOLL_LONGTERM for unprivileged users is a strong no-go >> stands as it is. > > As this basically excludes long standing pre-existing things like > RDMA, XDP, io_uring, and more I don't think this can be the general > answer for mm, sorry. Let's think about options to restrict FOLL_LONGTERM usage: One option would be to add toggle(s) (e.g., kernel cmdline options) to make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The admin can opt in if unprivileged users should have that capability. A distro might overwrite the default and set it to "on". I'm not completely happy about that. Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, but instead as something that explicitly matches the differing semantics. We could have a limit for privileged and one for unprivileged users. The default in the kernel could be 0 but an admin/system can overwrite it to opt in and a distro might apply different rules. Yes, we're back to the original question about limits, but now with the thought that FOLL_LONGTERM really is different than mlock and potentially more dangerous. At the same time, eventually work on proper alternatives with mmu notifiers (and possibly without the any such limits) where possible and required. (I assume it's hardly possible for RDMA because of the way the hardware works) Just some ideas, open for alternatives. I know that for the cases where we want it to "just work" for unprivileged users but cannot even have alternative implementations, this is bad. > > Sure, lets stop now since I don't think we can agree. Don't get me wrong, I really should be working on other stuff, so I have limited brain capacity and time :) OTOH I'm willing to help at least discuss alternatives. Let's think about realistic alternatives to keep FOLL_LONGTERM for any user working (that would tackle the extreme fragmentation issue at least, ignoring e.g., other fragmentation we can trigger with FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA): The nasty thing really is splitting a compound page and then pinning some pages, even if it's pinning the complete compound range. Ideally, we'd defer any action to the time we actually FOLL_LONGTERM pin a page. a) I think we cannot migrate pages when splitting the PMD (e.g., unmap, MADV_DONTNEED, swap?, page compaction?). User space can just pin the compound page to block migration. b) We might migrate pages when splitting the compound page. In split_huge_page_to_list() we know that we have nobody pinning the page. I did not check if it's possible. There might be cases where it's not immediately clear if it's possible (e.g., inside shrink_page_list()) It would mean that we would migrate pages essentially any time we split a compound page because there could be someone FOLL_LONGTERM pinning the page later. Usually we'd expect page compaction to fix this up on actual demand. I'd call this sub-optimal. c) We migrate any time someone FOLL_LONGTERM pins a page and the page is not pinned yet -- because it might have been a split compound page. I think we can agree that that's not an option :) d) We remember if a page was part of a compound page and was not freed yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately, we're short on pageflags for anon pages I think. Hm, alternatives? -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 16:43 ` David Hildenbrand @ 2021-11-24 18:35 ` Jason Gunthorpe 2021-11-24 19:09 ` David Hildenbrand 2021-11-24 18:37 ` David Hildenbrand 1 sibling, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 18:35 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 05:43:58PM +0100, David Hildenbrand wrote: > On 24.11.21 16:34, Jason Gunthorpe wrote: > > On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: > > > >> I'm not aware of any where you can fragment 50% of all pageblocks in the > >> system as an unprivileged user essentially consuming almost no memory > >> and essentially staying inside well-defined memlock limits. But sure if > >> there are "many" people will be able to come up with at least one > >> comparable thing. I'll be happy to learn. > > > > If the concern is that THP's can be DOS'd then any avenue that renders > > the system out of THPs is a DOS attack vector. Including all the > > normal workloads that people run and already complain that THPs get > > exhausted. > > > > A hostile userspace can only quicken this process. > > We can not only fragment THP but also easily smaller compound pages, > with less impact though (well, as long as people want more than 0.1% per > user ...). My point is as long as userspace can drive this fragmentation, by any means, we can never have DOS proof higher order pages, so lets not worry so much about one of many ways to create fragmentation. > >> My position that FOLL_LONGTERM for unprivileged users is a strong no-go > >> stands as it is. > > > > As this basically excludes long standing pre-existing things like > > RDMA, XDP, io_uring, and more I don't think this can be the general > > answer for mm, sorry. > > Let's think about options to restrict FOLL_LONGTERM usage: Which gives me the view that we should be talking about how to make high order pages completely DOS proof, not about FOLL_LONGTERM. To me that is exactly what ZONE_MOVABLE strives to achieve, and I think anyone who cares about QOS around THP must include ZONE_MOVABLE in their solution. In all of this I am thinking back to the discussion about the 1GB THP proposal which was resoundly shot down on the grounds that 2MB THP *doesn't work* today due to the existing fragmentation problems. > Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, > but instead as something that explicitly matches the differing > semantics. Also a good idea, someone who cares about this should really put pinned pages into the cgroup machinery (with correct accounting!) > At the same time, eventually work on proper alternatives with mmu > notifiers (and possibly without the any such limits) where possible > and required. mmu_notifiers is also bad, it just offends a different group of MM concerns :) Something like io_ring is registering a bulk amount of memory and then doing some potentially long operations against it. So to use a mmu_notifier scheme you'd have to block the mmu_notifier invalidate_range_start until all the operations touching the memory finish (and suspend new operations at the same time!). Blocking the notifier like this locks up the migration/etc threads completely, and is destructive to the OOM reclaim. At least with a pinned page those threads don't even try to touch it instead of getting stuck up. > Don't get me wrong, I really should be working on other stuff, so I have > limited brain capacity and time :) OTOH I'm willing to help at least > discuss alternatives. Haha, me too.. Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 18:35 ` Jason Gunthorpe @ 2021-11-24 19:09 ` David Hildenbrand 2021-11-24 23:11 ` Jason Gunthorpe 0 siblings, 1 reply; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 19:09 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 19:35, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 05:43:58PM +0100, David Hildenbrand wrote: >> On 24.11.21 16:34, Jason Gunthorpe wrote: >>> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: >>> >>>> I'm not aware of any where you can fragment 50% of all pageblocks in the >>>> system as an unprivileged user essentially consuming almost no memory >>>> and essentially staying inside well-defined memlock limits. But sure if >>>> there are "many" people will be able to come up with at least one >>>> comparable thing. I'll be happy to learn. >>> >>> If the concern is that THP's can be DOS'd then any avenue that renders >>> the system out of THPs is a DOS attack vector. Including all the >>> normal workloads that people run and already complain that THPs get >>> exhausted. >>> >>> A hostile userspace can only quicken this process. >> >> We can not only fragment THP but also easily smaller compound pages, >> with less impact though (well, as long as people want more than 0.1% per >> user ...). > > My point is as long as userspace can drive this fragmentation, by any > means, we can never have DOS proof higher order pages, so lets not > worry so much about one of many ways to create fragmentation. > That would be giving up on compound pages (hugetlbfs, THP, ...) on any current Linux system that does not use ZONE_MOVABLE -- which is not something I am not willing to buy into, just like our customers ;) See my other mail, the upstream version of my reproducer essentially shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And at least to me that's an interesting insight :) I agree that the more extreme scenarios I can construct are a secondary concern. But my upstream reproducer just highlights what can easily happen in reality. >>>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go >>>> stands as it is. >>> >>> As this basically excludes long standing pre-existing things like >>> RDMA, XDP, io_uring, and more I don't think this can be the general >>> answer for mm, sorry. >> >> Let's think about options to restrict FOLL_LONGTERM usage: > > Which gives me the view that we should be talking about how to make > high order pages completely DOS proof, not about FOLL_LONGTERM. Sure, one step at a time ;) > > To me that is exactly what ZONE_MOVABLE strives to achieve, and I > think anyone who cares about QOS around THP must include ZONE_MOVABLE > in their solution. For 100% yes. > > In all of this I am thinking back to the discussion about the 1GB THP > proposal which was resoundly shot down on the grounds that 2MB THP > *doesn't work* today due to the existing fragmentation problems. The point that "2MB THP" doesn't work is just wrong. pageblocks do their job very well, but we can end up in corner case situations where more and more pageblocks are getting fragmented. And people are constantly improving these corner cases (e.g. proactive compaction). Usually you have to allocate *a lot* of memory and put the system under extreme memory pressure, such that unmovable allocations spill into movable pageblocks and the other way around. The thing about my reproducer is that it does that without any memory pressure, and that is the BIG difference to everything else we have in that regard. You can have an idle 1TiB system running my reproducer and it will fragment half of of all pageblocks in the system while mlocking ~ 1GiB. And that highlights the real issue IMHO. The 1 GB THP project is still going on BTW. > >> Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, >> but instead as something that explicitly matches the differing >> semantics. > > Also a good idea, someone who cares about this should really put > pinned pages into the cgroup machinery (with correct accounting!) > >> At the same time, eventually work on proper alternatives with mmu >> notifiers (and possibly without the any such limits) where possible >> and required. > > mmu_notifiers is also bad, it just offends a different group of MM > concerns :) Yeah, I know, locking nightmare. > > Something like io_ring is registering a bulk amount of memory and then > doing some potentially long operations against it. The individual operations it performs are comparable to O_DIRECT I think -- but no expert. > > So to use a mmu_notifier scheme you'd have to block the mmu_notifier > invalidate_range_start until all the operations touching the memory > finish (and suspend new operations at the same time!). > > Blocking the notifier like this locks up the migration/etc threads > completely, and is destructive to the OOM reclaim. > > At least with a pinned page those threads don't even try to touch it > instead of getting stuck up. Yes, if only we'd be pinning for a restricted amount of time ... -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 19:09 ` David Hildenbrand @ 2021-11-24 23:11 ` Jason Gunthorpe 2021-11-30 15:52 ` David Hildenbrand 0 siblings, 1 reply; 58+ messages in thread From: Jason Gunthorpe @ 2021-11-24 23:11 UTC (permalink / raw) To: David Hildenbrand Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, Nov 24, 2021 at 08:09:42PM +0100, David Hildenbrand wrote: > That would be giving up on compound pages (hugetlbfs, THP, ...) on any > current Linux system that does not use ZONE_MOVABLE -- which is not > something I am not willing to buy into, just like our customers ;) So we have ZONE_MOVABLE but users won't use it? Then why is the solution to push the same kinds of restrictions as ZONE_MOVABLE on to ZONE_NORMAL? > See my other mail, the upstream version of my reproducer essentially > shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And > at least to me that's an interesting insight :) Hmm. To your reproducer it would be nice if we could cgroup control the # of page blocks a cgroup has pinned. Focusing on # pages pinned is clearly the wrong metric, I suggested the whole compound earlier, but your point about the entire page block being ruined makes sense too. It means pinned pages will have be migrated to already ruined page blocks the cgroup owns, which is a more controlled version of the FOLL_LONGTERM migration you have been thinking about. This would effectively limit the fragmentation a hostile process group can create. If we further treated unmovable cgroup charged kernel allocations as 'pinned' and routed them to the pinned page blocks it start to look really interesting. Kill the cgroup, get all your THPs back? Fragmentation cannot extend past the cgroup? ie there are lots of batch workloads that could be interesting there - wrap the batch in a cgroup, run it, then kill everything and since the cgroup gives some lifetime clustering to the allocator you get a lot less fragmentation when the batch is finished, so the next batch gets more THPs, etc. There is also sort of an interesting optimization opportunity - many FOLL_LONGTERM users would be happy to spend more time pinning to get nice contiguous memory ranges. Might help convince people that the extra pin time for migrations is worthwhile. > > Something like io_ring is registering a bulk amount of memory and then > > doing some potentially long operations against it. > > The individual operations it performs are comparable to O_DIRECT I think Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO timeouts and things. Plus io_uring is worse as the buffer is potentially shared by many in fight ops and you'd have to block new ops of the buffer and flush all running ops before any mapping change can happen, all while holding up a mmu notifier. Not only is it bad for mm subsystem operations, but would significantly harm io_uring performance if a migration hits. So, I really don't like abusing mmu notifiers for stuff like this. I didn't like it in virtio either :) Jason ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 23:11 ` Jason Gunthorpe @ 2021-11-30 15:52 ` David Hildenbrand 0 siblings, 0 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-30 15:52 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm (sorry, was busy working on other stuff) >> That would be giving up on compound pages (hugetlbfs, THP, ...) on any >> current Linux system that does not use ZONE_MOVABLE -- which is not >> something I am not willing to buy into, just like our customers ;) > > So we have ZONE_MOVABLE but users won't use it? It's mostly used in the memory hot(un)plug context and we'll see growing usage there in the near future (mostly due to dax/kmem, virtio-mem). One has to be very careful how to size ZONE_MOVABLE, though, and it's incompatible with various use cases (even huge pages on some architectures are not movable and cannot be placed on ZONE_MOVABLE ...). That's why we barely see it getting used automatically outside of memory hot(un)plug context or when explicitly setup by the admin for a well fine-tuned system. > > Then why is the solution to push the same kinds of restrictions as > ZONE_MOVABLE on to ZONE_NORMAL? On any zone except ZONE_DEVICE to be precise. Defragmentation is one of the main reasons we have pageblocks after all -- besides CMA and page isolation. If we don't care about de-fragmentation we could just squash MIGRATE_MOVABLE, MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE into a single type. But after all that's the only thing that provides us with THP in most setups out there. Note that some people (IIRC Mel) even proposed to remove ZONE_MOVABLE and instead have "sticky" MIGRATE_MOVABLE pageblocks, meaning MIGRATE_MOVABLE pageblocks that cannot be converted to a different type or stolen from -- which would mimic the same thing as the pageblocks we essentially have in ZONE_MOVABLE. > >> See my other mail, the upstream version of my reproducer essentially >> shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And >> at least to me that's an interesting insight :) > > Hmm. To your reproducer it would be nice if we could cgroup control > the # of page blocks a cgroup has pinned. Focusing on # pages pinned > is clearly the wrong metric, I suggested the whole compound earlier, > but your point about the entire page block being ruined makes sense > too. # pages pinned is part of the story, but yes, "pinned something inside a pageblocks" is a better metric. I would think that this might be complicated to track, though ... especially once we have multiple cgroups pinning inside a single pageblock. Hm ... > > It means pinned pages will have be migrated to already ruined page > blocks the cgroup owns, which is a more controlled version of the > FOLL_LONGTERM migration you have been thinking about. MIGRATE_UNMOVABLE pageblocks are already ruined. But we'd need some way to manage/charge pageblocks per cgroup I guess? that sounds very interesting. > > This would effectively limit the fragmentation a hostile process group > can create. If we further treated unmovable cgroup charged kernel > allocations as 'pinned' and routed them to the pinned page blocks it > start to look really interesting. Kill the cgroup, get all your THPs > back? Fragmentation cannot extend past the cgroup? So essentially any accounted unmovable kernel allocation (e.g., page tables, secretmem, ... ) would try to be placed on a MIGRATE_UNMOVABLE pageblock "charged" to the respective cgroup? > > ie there are lots of batch workloads that could be interesting there - > wrap the batch in a cgroup, run it, then kill everything and since the > cgroup gives some lifetime clustering to the allocator you get a lot > less fragmentation when the batch is finished, so the next batch gets > more THPs, etc. > > There is also sort of an interesting optimization opportunity - many > FOLL_LONGTERM users would be happy to spend more time pinning to get > nice contiguous memory ranges. Might help convince people that the > extra pin time for migrations is worthwhile. Indeed. And fortunately, huge page users (heavily used in vfio context and for VMs) wouldn't be affected because they only pin huge pages and there is nothing to migrate then (well, excluding MIGRATE_CMA and ZONE_MOVABLE what we have already, of course). > >>> Something like io_ring is registering a bulk amount of memory and then >>> doing some potentially long operations against it. >> >> The individual operations it performs are comparable to O_DIRECT I think > > Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO > timeouts and things. > I might be wrong about O_DIRECT semantics, though. Staring at fs/io_uring.c I don't really have a clue how they are getting used. I assume they are getting used for DMA directly. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 16:43 ` David Hildenbrand 2021-11-24 18:35 ` Jason Gunthorpe @ 2021-11-24 18:37 ` David Hildenbrand 1 sibling, 0 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 18:37 UTC (permalink / raw) To: Jason Gunthorpe Cc: Vlastimil Babka, Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm, Mike Rapoport, Pavel Tatashin, [email protected], Michal Hocko On 24.11.21 17:43, David Hildenbrand wrote: > On 24.11.21 16:34, Jason Gunthorpe wrote: >> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: >> >>> I'm not aware of any where you can fragment 50% of all pageblocks in the >>> system as an unprivileged user essentially consuming almost no memory >>> and essentially staying inside well-defined memlock limits. But sure if >>> there are "many" people will be able to come up with at least one >>> comparable thing. I'll be happy to learn. >> >> If the concern is that THP's can be DOS'd then any avenue that renders >> the system out of THPs is a DOS attack vector. Including all the >> normal workloads that people run and already complain that THPs get >> exhausted. >> >> A hostile userspace can only quicken this process. > > We can not only fragment THP but also easily smaller compound pages, > with less impact though (well, as long as people want more than 0.1% per > user ...). > > We want to make more excessive use of THP; the whole folio work is about > using THP. Some people are even working on increasing the MAX_ORDER and > introduce gigantic THP. > > And here we are having mechanisms available to unprivileged users to > just sabotage the very thing at its core extremely easily. Personally, I > think this is very bad, but that's just my humble opinion. > >> >>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go >>> stands as it is. >> >> As this basically excludes long standing pre-existing things like >> RDMA, XDP, io_uring, and more I don't think this can be the general >> answer for mm, sorry. > > Let's think about options to restrict FOLL_LONGTERM usage: > > One option would be to add toggle(s) (e.g., kernel cmdline options) to > make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The > admin can opt in if unprivileged users should have that capability. A > distro might overwrite the default and set it to "on". I'm not > completely happy about that. > > Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, > but instead as something that explicitly matches the differing > semantics. We could have a limit for privileged and one for unprivileged > users. The default in the kernel could be 0 but an admin/system can > overwrite it to opt in and a distro might apply different rules. Yes, > we're back to the original question about limits, but now with the > thought that FOLL_LONGTERM really is different than mlock and > potentially more dangerous. > > At the same time, eventually work on proper alternatives with mmu > notifiers (and possibly without the any such limits) where possible and > required. (I assume it's hardly possible for RDMA because of the way the > hardware works) > > Just some ideas, open for alternatives. I know that for the cases where > we want it to "just work" for unprivileged users but cannot even have > alternative implementations, this is bad. > >> >> Sure, lets stop now since I don't think we can agree. > > Don't get me wrong, I really should be working on other stuff, so I have > limited brain capacity and time :) OTOH I'm willing to help at least > discuss alternatives. > > > Let's think about realistic alternatives to keep FOLL_LONGTERM for any > user working (that would tackle the extreme fragmentation issue at > least, ignoring e.g., other fragmentation we can trigger with > FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA): > > The nasty thing really is splitting a compound page and then pinning > some pages, even if it's pinning the complete compound range. Ideally, > we'd defer any action to the time we actually FOLL_LONGTERM pin a page. > > > a) I think we cannot migrate pages when splitting the PMD (e.g., unmap, > MADV_DONTNEED, swap?, page compaction?). User space can just pin the > compound page to block migration. > > b) We might migrate pages when splitting the compound page. In > split_huge_page_to_list() we know that we have nobody pinning the page. > I did not check if it's possible. There might be cases where it's not > immediately clear if it's possible (e.g., inside shrink_page_list()) > > It would mean that we would migrate pages essentially any time we split > a compound page because there could be someone FOLL_LONGTERM pinning the > page later. Usually we'd expect page compaction to fix this up on actual > demand. I'd call this sub-optimal. > > c) We migrate any time someone FOLL_LONGTERM pins a page and the page is > not pinned yet -- because it might have been a split compound page. I > think we can agree that that's not an option :) > > d) We remember if a page was part of a compound page and was not freed > yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately, > we're short on pageflags for anon pages I think. > > Hm, alternatives? > And while thinking about the other "non malicious" fragmentation issues -- especially triggering what my reproducer triggered by pure luck simply because our pages we're pinning are scattered all over pageblocks -- and remembering what Vlastimil said regarding grouping, I finally understood why FOLL_LONGTERM is slightly wrong as is. We allocate user pages with GFP_HIGHUSER_MOVABLE: "``GFP_HIGHUSER_MOVABLE`` does not require that allocated memory will be directly accessible by the kernel and implies that the data is movable." This holds true for mlock (except when migration is disabled for RT systems, which is a special case already). This does not hold true once FOLL_LONGTERM turns the pages essentially unmovable. We tried to fix some of that fallout by migrating such pages when they are residing on MIGRATE_CMA and ZONE_MOVABLE before FOLL_LONGTERM. It's only part of the story, because we're fragmenting each and ever !MIGRATE_UNMOVABLE pageblock with unmovable data. If we're unlucky 50% (with the 0.1% RAM rule) of all MIGRATE_MOVABLE pageblocks in the system. I had the exact same discussion ("allocating unmovable data for user space") with Mike regarding "secretmem", resulting in us using GFP_HIGHUSER for allocation instead, and with Intel folks just now regarding unmovable fd-based guest mappings. a) "unmovable memory for user space" really is different from mlock (RLIMIT_MEMLOCK). I think we should have much better control over the amount of unmovable memory we directly let user space allocate for mapping into the process page tables. A separate RLIMIT sounds reasonable, and I think I discussed that with Mike already briefly during secretmem discussions, and we decided to defer introducing something like that. b) To avoid fragmenting !MIGRATE_UNMOVABLE pageblock, we would have to migrate pages into MIGRATE_UNMOVABLE pageblocks before FOLL_LONGTERM. If we're pinning a complete pageblock, we're essentially converting that pageblock to MIGRATE_UNMOVABLE. c) When unpinning, theoretically we would want to migrate the now again movable page out of a MIGRATE_UNMOVABLE pageblock if there is sufficient memory. We have right now the following, which highlights the issue: /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */ static inline bool is_pinnable_page(struct page *page) { return !(is_zone_movable_page(page) || is_migrate_cma_page(page)) || is_zero_pfn(page_to_pfn(page)) } These things would at least make FOLL_LONGTERM do something reasonable in respect to pageblocks in most cases and would let an admin have control over how much "unmovable allocations" user space can allocate. It sure comes with a price when FOLL_LONGTERM pinning a page. Further, it would let an admin have better control how much unmovable data user space can allocate directly for mapping into user space (excluding other unmovable allocations user space can trigger, of course -- but to me there is a difference between "I am malicious and want to hurt the kernel by allocating a lot of unmovable memory" and "I am a sane user and want to make use of features that are a performance improvement"). It wouldn't cover extended malicious case I presented (first pin the whole pageblock, then pin a single page, then unpin the whole pageblock), they would simply turn 50% of all pageblocks MIGRATE_UNMOVABLE and only have a single page pinned in there. And it wouldn't solve the issue of "how much unmovable memory to be mapped into processes do we as a kernel think is reasonable for a single user on the system". With something along above lines (and the malicious cases fixed) I think I could sleep better at night with FOLL_LONGTERM being allowed for unprivileged users. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 8:57 ` David Hildenbrand 2021-11-24 13:23 ` Jason Gunthorpe @ 2021-11-24 14:37 ` Vlastimil Babka 2021-11-24 14:41 ` David Hildenbrand 1 sibling, 1 reply; 58+ messages in thread From: Vlastimil Babka @ 2021-11-24 14:37 UTC (permalink / raw) To: David Hildenbrand, Jason Gunthorpe Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/24/21 09:57, David Hildenbrand wrote: > On 24.11.21 00:59, Jason Gunthorpe wrote: >>> Similarly for io-uring we could be migrating pages to be pinned so that >>> the end up consolidated close together, and prevent pathologic >>> situations like in David's reproducer. >> >> It is an interesting idea to have GUP do some kind of THP preserving >> migration. > > > Unfortunately it will only be a band aid AFAIU. I can rewrite my > reproducer fairly easily to pin the whole 2M range first, pin a second > time only a single page, and then unpin the 2M range, resulting in the > very same way to block THP. (I can block some THP less because I always > need the possibility to memlock 2M first, though). Hm I see, then we could also condsider making it possible to migrate the pinned pages - of course io-uring would have to be cooperative here, similarly to anything that supports PageMovable. Mlocked pages can also be migrated (I think there's some sysctl to prevent that, if the guarantee of not even a minor pagefault is considered more important). ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-24 14:37 ` Vlastimil Babka @ 2021-11-24 14:41 ` David Hildenbrand 0 siblings, 0 replies; 58+ messages in thread From: David Hildenbrand @ 2021-11-24 14:41 UTC (permalink / raw) To: Vlastimil Babka, Jason Gunthorpe Cc: Jens Axboe, Andrew Dona-Couch, Andrew Morton, Drew DeVault, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 24.11.21 15:37, Vlastimil Babka wrote: > On 11/24/21 09:57, David Hildenbrand wrote: >> On 24.11.21 00:59, Jason Gunthorpe wrote: >>>> Similarly for io-uring we could be migrating pages to be pinned so that >>>> the end up consolidated close together, and prevent pathologic >>>> situations like in David's reproducer. >>> >>> It is an interesting idea to have GUP do some kind of THP preserving >>> migration. >> >> >> Unfortunately it will only be a band aid AFAIU. I can rewrite my >> reproducer fairly easily to pin the whole 2M range first, pin a second >> time only a single page, and then unpin the 2M range, resulting in the >> very same way to block THP. (I can block some THP less because I always >> need the possibility to memlock 2M first, though). > > Hm I see, then we could also condsider making it possible to migrate the > pinned pages - of course io-uring would have to be cooperative here, > similarly to anything that supports PageMovable. I might be wrong but that would then essentially be an actual mlock+mmu notifier mechanism, and no longer FOLL_LONGTERM. And the mlock could actually be done by user space and would be optional. I'd be very happy to see something like that instead ... but so far people don't even agree that it's an issue worth fixing. So ... -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 4:35 ` Andrew Morton 2021-11-16 6:32 ` Drew DeVault @ 2021-11-16 18:36 ` Matthew Wilcox 2021-11-16 18:44 ` Drew DeVault 2021-11-16 18:55 ` Jens Axboe 2021-11-17 22:26 ` Johannes Weiner 2 siblings, 2 replies; 58+ messages in thread From: Matthew Wilcox @ 2021-11-16 18:36 UTC (permalink / raw) To: Andrew Morton Cc: Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote: > I'd also be interested in seeing feedback from the MM developers. [...] > Subject: Increase default MLOCK_LIMIT to 8 MiB On the one hand, processes can already allocate at least this much memory that is non-swappable, just by doing things like opening a lot of files (allocating struct file & fdtable), using a lot of address space (allocating page tables), so I don't have a problem with it per se. On the other hand, 64kB is available on anything larger than an IBM XT. Linux will still boot on machines with 4MB of RAM (eg routers). For someone with a machine with only, say, 32MB of memory, this allows a process to make a quarter of the memory unswappable, and maybe that's not a good idea. So perhaps this should scale over a certain range? Is 8MB a generally useful amount of memory for an iouring user anyway? If you're just playing with it, sure, but if you have, oh i don't know, a database, don't you want to pin the entire cache and allow IO to the whole thing? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 18:36 ` Matthew Wilcox @ 2021-11-16 18:44 ` Drew DeVault 2021-11-16 18:55 ` Jens Axboe 1 sibling, 0 replies; 58+ messages in thread From: Drew DeVault @ 2021-11-16 18:44 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Tue Nov 16, 2021 at 7:36 PM CET, Matthew Wilcox wrote: > On the one hand, processes can already allocate at least this much > memory that is non-swappable, just by doing things like opening a lot of > files (allocating struct file & fdtable), using a lot of address space > (allocating page tables), so I don't have a problem with it per se. > > On the other hand, 64kB is available on anything larger than an IBM XT. > Linux will still boot on machines with 4MB of RAM (eg routers). For > someone with a machine with only, say, 32MB of memory, this allows a > process to make a quarter of the memory unswappable, and maybe that's > not a good idea. So perhaps this should scale over a certain range? I feel like most of the uber-small machines which are still relevant are not running arbitrary user code, so, something about an airtight hatch goes here. On the other hand, consider your other hand: you can probably find a way to allocate this much stuff anyway. > Is 8MB a generally useful amount of memory for an iouring user anyway? > If you're just playing with it, sure, but if you have, oh i don't know, > a database, don't you want to pin the entire cache and allow IO to the > whole thing? If you're a databse, you're probably running as a daemon with some integration with the service manager, most of which have provisions for tuning the ulimits as necessary. The purpose of this change is to provide an amount which is more useful for end-user programs, which usually cannot adjust their ulimits by any similarly convenient means. 8 MiB is not a lot, but it is enough to allocate a modest handful of read/write buffers for a video game, mail client, or something else along those lines of thought, perhaps specifically narrowing in on the areas which demand the most performance. We could certainly go higher and find an even more useful (but still realistic) value, but I felt it best to err on the side of a more conservative improvements. Honestly, this number could go as high as we want it to and applications would happily take it. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 18:36 ` Matthew Wilcox 2021-11-16 18:44 ` Drew DeVault @ 2021-11-16 18:55 ` Jens Axboe 2021-11-16 19:21 ` Vito Caputo 1 sibling, 1 reply; 58+ messages in thread From: Jens Axboe @ 2021-11-16 18:55 UTC (permalink / raw) To: Matthew Wilcox, Andrew Morton Cc: Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/16/21 11:36 AM, Matthew Wilcox wrote: > On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote: >> I'd also be interested in seeing feedback from the MM developers. > [...] >> Subject: Increase default MLOCK_LIMIT to 8 MiB > > On the one hand, processes can already allocate at least this much > memory that is non-swappable, just by doing things like opening a lot of > files (allocating struct file & fdtable), using a lot of address space > (allocating page tables), so I don't have a problem with it per se. > > On the other hand, 64kB is available on anything larger than an IBM XT. > Linux will still boot on machines with 4MB of RAM (eg routers). For > someone with a machine with only, say, 32MB of memory, this allows a > process to make a quarter of the memory unswappable, and maybe that's > not a good idea. So perhaps this should scale over a certain range? > > Is 8MB a generally useful amount of memory for an iouring user anyway? > If you're just playing with it, sure, but if you have, oh i don't know, > a database, don't you want to pin the entire cache and allow IO to the > whole thing? 8MB is plenty for most casual use cases, which is exactly the ones that we want to "just work" without requiring weird system level modifications to increase the memlock limit. For db etc server setups, you're going to be mucking with the setup anyway, and then I don't see it as a big problem that you'll need to increase it further. Because yes, that won't fit within 8MB if you start doing registered buffers. -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 18:55 ` Jens Axboe @ 2021-11-16 19:21 ` Vito Caputo 2021-11-16 19:25 ` Drew DeVault 2021-11-16 19:41 ` Jens Axboe 0 siblings, 2 replies; 58+ messages in thread From: Vito Caputo @ 2021-11-16 19:21 UTC (permalink / raw) To: Jens Axboe Cc: Matthew Wilcox, Andrew Morton, Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue, Nov 16, 2021 at 11:55:41AM -0700, Jens Axboe wrote: > On 11/16/21 11:36 AM, Matthew Wilcox wrote: > > On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote: > >> I'd also be interested in seeing feedback from the MM developers. > > [...] > >> Subject: Increase default MLOCK_LIMIT to 8 MiB > > > > On the one hand, processes can already allocate at least this much > > memory that is non-swappable, just by doing things like opening a lot of > > files (allocating struct file & fdtable), using a lot of address space > > (allocating page tables), so I don't have a problem with it per se. > > > > On the other hand, 64kB is available on anything larger than an IBM XT. > > Linux will still boot on machines with 4MB of RAM (eg routers). For > > someone with a machine with only, say, 32MB of memory, this allows a > > process to make a quarter of the memory unswappable, and maybe that's > > not a good idea. So perhaps this should scale over a certain range? > > > > Is 8MB a generally useful amount of memory for an iouring user anyway? > > If you're just playing with it, sure, but if you have, oh i don't know, > > a database, don't you want to pin the entire cache and allow IO to the > > whole thing? > > 8MB is plenty for most casual use cases, which is exactly the ones that > we want to "just work" without requiring weird system level > modifications to increase the memlock limit. > Considering a single fullscreen 32bpp 4K-resolution framebuffer is ~32MiB, I'm not convinced this is really correct in nearly 2022. If we're going to bump the default at the kernel, I'm with Matthew on making it autoscale within a sane range, depending on available memory. As an upper bound I'd probably look at the highest anticipated consumer resolutions, and handle a couple fullscreen 32bpp instances being pinned. Regards, Vito Caputo ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 19:21 ` Vito Caputo @ 2021-11-16 19:25 ` Drew DeVault 2021-11-16 19:46 ` Vito Caputo 2021-11-16 19:41 ` Jens Axboe 1 sibling, 1 reply; 58+ messages in thread From: Drew DeVault @ 2021-11-16 19:25 UTC (permalink / raw) To: Vito Caputo, Jens Axboe Cc: Matthew Wilcox, Andrew Morton, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue Nov 16, 2021 at 8:21 PM CET, Vito Caputo wrote: > Considering a single fullscreen 32bpp 4K-resolution framebuffer is > ~32MiB, I'm not convinced this is really correct in nearly 2022. Can you name a practical use-case where you'll be doing I/O with uncompressed 4K framebuffers? The kind of I/O which is supported by io_uring, to be specific, not, say, handing it off to libdrm. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 19:25 ` Drew DeVault @ 2021-11-16 19:46 ` Vito Caputo 0 siblings, 0 replies; 58+ messages in thread From: Vito Caputo @ 2021-11-16 19:46 UTC (permalink / raw) To: Drew DeVault Cc: Jens Axboe, Matthew Wilcox, Andrew Morton, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Tue, Nov 16, 2021 at 08:25:33PM +0100, Drew DeVault wrote: > On Tue Nov 16, 2021 at 8:21 PM CET, Vito Caputo wrote: > > Considering a single fullscreen 32bpp 4K-resolution framebuffer is > > ~32MiB, I'm not convinced this is really correct in nearly 2022. > > Can you name a practical use-case where you'll be doing I/O with > uncompressed 4K framebuffers? The kind of I/O which is supported by > io_uring, to be specific, not, say, handing it off to libdrm. Obviously video/image editing software tends to operate on raw frames, and sometimes even persists them via filesystems. I haven't given it a lot of thought, but a framebuffer is a commonly used unit of memory allocation in code run on the CPU I've written over the years. Being able to pin those for something like io_uring (or some other DMA-possible interface) seems like an obvious memory-hungry thing to consider here if we're talking default upper bounds. Regards, Vito Caputo ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 19:21 ` Vito Caputo 2021-11-16 19:25 ` Drew DeVault @ 2021-11-16 19:41 ` Jens Axboe 1 sibling, 0 replies; 58+ messages in thread From: Jens Axboe @ 2021-11-16 19:41 UTC (permalink / raw) To: Vito Caputo Cc: Matthew Wilcox, Andrew Morton, Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/16/21 12:21 PM, Vito Caputo wrote: > On Tue, Nov 16, 2021 at 11:55:41AM -0700, Jens Axboe wrote: >> On 11/16/21 11:36 AM, Matthew Wilcox wrote: >>> On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote: >>>> I'd also be interested in seeing feedback from the MM developers. >>> [...] >>>> Subject: Increase default MLOCK_LIMIT to 8 MiB >>> >>> On the one hand, processes can already allocate at least this much >>> memory that is non-swappable, just by doing things like opening a lot of >>> files (allocating struct file & fdtable), using a lot of address space >>> (allocating page tables), so I don't have a problem with it per se. >>> >>> On the other hand, 64kB is available on anything larger than an IBM XT. >>> Linux will still boot on machines with 4MB of RAM (eg routers). For >>> someone with a machine with only, say, 32MB of memory, this allows a >>> process to make a quarter of the memory unswappable, and maybe that's >>> not a good idea. So perhaps this should scale over a certain range? >>> >>> Is 8MB a generally useful amount of memory for an iouring user anyway? >>> If you're just playing with it, sure, but if you have, oh i don't know, >>> a database, don't you want to pin the entire cache and allow IO to the >>> whole thing? >> >> 8MB is plenty for most casual use cases, which is exactly the ones that >> we want to "just work" without requiring weird system level >> modifications to increase the memlock limit. >> > > Considering a single fullscreen 32bpp 4K-resolution framebuffer is > ~32MiB, I'm not convinced this is really correct in nearly 2022. You don't need to register any buffers, and I don't expect any basic uses cases to do so. Which means that the 8MB just need to cover the ring itself, and you can fit a _lot_ of rings into 8MB. The memlock limit only applies to buffers if you register them, not for any "normal" use cases where you just pass buffers for read/write or O_DIRECT read/write. > If we're going to bump the default at the kernel, I'm with Matthew on > making it autoscale within a sane range, depending on available > memory. I just don't want to turn this into a bikeshedding conversation. I'm fine with making it autoscale obviously, but who's going to do the work? > As an upper bound I'd probably look at the highest anticipated > consumer resolutions, and handle a couple fullscreen 32bpp instances > being pinned. Not sure I see the relevance here. -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-16 4:35 ` Andrew Morton 2021-11-16 6:32 ` Drew DeVault 2021-11-16 18:36 ` Matthew Wilcox @ 2021-11-17 22:26 ` Johannes Weiner 2021-11-17 23:17 ` Jens Axboe 2 siblings, 1 reply; 58+ messages in thread From: Johannes Weiner @ 2021-11-17 22:26 UTC (permalink / raw) To: Andrew Morton Cc: Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Jens Axboe, Pavel Begunkov, linux-mm On Mon, Nov 15, 2021 at 08:35:30PM -0800, Andrew Morton wrote: > On Sat, 6 Nov 2021 14:12:45 +0700 Ammar Faizi <[email protected]> wrote: > > > On 11/6/21 2:05 PM, Drew DeVault wrote: > > > Should I send a v2 or is this email sufficient: > > > > > > Signed-off-by: Drew DeVault <[email protected]> > > > > Oops, I missed akpm from the CC list. Added Andrew. > > > > Cc: Andrew Morton <[email protected]> > > Ref: https://lore.kernel.org/io-uring/CFII8LNSW5XH.3OTIVFYX8P65Y@taiga/ > > Let's cc linux-mm as well. > > > Unfortunately I didn't know about this until Nov 4, which was formally > too late for 5.16. I guess I could try to sneak it past Linus if > someone were to send me some sufficiently convincing words explaining > the urgency. > > I'd also be interested in seeing feedback from the MM developers. > > And a question: rather than messing around with a constant which will > need to be increased again in a couple of years, can we solve this one > and for all? For example, permit root to set the system-wide > per-process max mlock size and depend upon initscripts to do this > appropriately. My take is that as long as the kernel sets some limit per default on this at all, it should be one that works for common workloads. Today this isn't the case. We've recently switched our initscripts at FB to set the default to 0.1% of total RAM. The impetus for this was a subtle but widespread issue where we failed to mmap the PERF_COUNT_SW_TASK_CLOCK event counter (perf event mmap also uses RLIMIT_MEMLOCK!) and silently fell back to the much less efficient clock_gettime() syscall. Because the failure mode was subtle and annoying, we didn't just want to raise the limit, but raise it so that no reasonable application would run into it, and only buggy or malicious ones would. And IMO, that's really what rlimits should be doing: catching clearly bogus requests, not trying to do fine-grained resource control. For more reasonable overuse that ends up causing memory pressure, the OOM killer will do the right thing since the pages still belong to tasks. So 0.1% of the machine seemed like a good default formula for that. And it would be a bit more future proof too. On my 32G desktop machine, that would be 32M. For comparison, the default process rlimit on that machine is ~120k, which comes out to ~2G worth of kernel stack, which also isn't reclaimable without OOM... > From: Drew DeVault <[email protected]> > Subject: Increase default MLOCK_LIMIT to 8 MiB > > This limit has not been updated since 2008, when it was increased to 64 > KiB at the request of GnuPG. Until recently, the main use-cases for this > feature were (1) preventing sensitive memory from being swapped, as in > GnuPG's use-case; and (2) real-time use-cases. In the first case, little > memory is called for, and in the second case, the user is generally in a > position to increase it if they need more. > > The introduction of IOURING_REGISTER_BUFFERS adds a third use-case: > preparing fixed buffers for high-performance I/O. This use-case will take > as much of this memory as it can get, but is still limited to 64 KiB by > default, which is very little. This increases the limit to 8 MB, which > was chosen fairly arbitrarily as a more generous, but still conservative, > default value. > > It is also possible to raise this limit in userspace. This is easily > done, for example, in the use-case of a network daemon: systemd, for > instance, provides for this via LimitMEMLOCK in the service file; OpenRC > via the rc_ulimit variables. However, there is no established userspace > facility for configuring this outside of daemons: end-user applications do > not presently have access to a convenient means of raising their limits. > > The buck, as it were, stops with the kernel. It's much easier to address > it here than it is to bring it to hundreds of distributions, and it can > only realistically be relied upon to be high-enough by end-user software > if it is more-or-less ubiquitous. Most distros don't change this > particular rlimit from the kernel-supplied default value, so a change here > will easily provide that ubiquity. > > Link: https://lkml.kernel.org/r/[email protected] > Signed-off-by: Drew DeVault <[email protected]> > Acked-by: Jens Axboe <[email protected]> > Acked-by: Cyril Hrubis <[email protected]> > Cc: Pavel Begunkov <[email protected]> > Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> As per above, I think basing it off of RAM size would be better, but this increase is overdue given all the new users beyond mlock(), and 8M is much better than the current value. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-17 22:26 ` Johannes Weiner @ 2021-11-17 23:17 ` Jens Axboe 2021-11-18 21:58 ` Andrew Morton 0 siblings, 1 reply; 58+ messages in thread From: Jens Axboe @ 2021-11-17 23:17 UTC (permalink / raw) To: Johannes Weiner, Andrew Morton Cc: Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On 11/17/21 3:26 PM, Johannes Weiner wrote: >> Link: https://lkml.kernel.org/r/[email protected] >> Signed-off-by: Drew DeVault <[email protected]> >> Acked-by: Jens Axboe <[email protected]> >> Acked-by: Cyril Hrubis <[email protected]> >> Cc: Pavel Begunkov <[email protected]> >> Signed-off-by: Andrew Morton <[email protected]> > > Acked-by: Johannes Weiner <[email protected]> > > As per above, I think basing it off of RAM size would be better, but > this increase is overdue given all the new users beyond mlock(), and > 8M is much better than the current value. That's basically my reasoning too. Let's just get something going that will at least unblock some valid use cases, and not get bogged down with aiming for perfection. The latter can happen in parallel, but it should not hold it up imho. -- Jens Axboe ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-17 23:17 ` Jens Axboe @ 2021-11-18 21:58 ` Andrew Morton 2021-11-19 7:41 ` Drew DeVault 0 siblings, 1 reply; 58+ messages in thread From: Andrew Morton @ 2021-11-18 21:58 UTC (permalink / raw) To: Jens Axboe Cc: Johannes Weiner, Ammar Faizi, Drew DeVault, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Wed, 17 Nov 2021 16:17:26 -0700 Jens Axboe <[email protected]> wrote: > On 11/17/21 3:26 PM, Johannes Weiner wrote: > >> Link: https://lkml.kernel.org/r/[email protected] > >> Signed-off-by: Drew DeVault <[email protected]> > >> Acked-by: Jens Axboe <[email protected]> > >> Acked-by: Cyril Hrubis <[email protected]> > >> Cc: Pavel Begunkov <[email protected]> > >> Signed-off-by: Andrew Morton <[email protected]> > > > > Acked-by: Johannes Weiner <[email protected]> > > > > As per above, I think basing it off of RAM size would be better, but > > this increase is overdue given all the new users beyond mlock(), and > > 8M is much better than the current value. > > That's basically my reasoning too. Let's just get something going that > will at least unblock some valid use cases, and not get bogged down with > aiming for perfection. The latter can happen in parallel, but it should > not hold it up imho. Nobody's aiming for perfection. We're discussing aiming for "better". What we should have done on day one was to set the default MLOCK_LIMIT to zero bytes. Then everyone would have infrastructure to tune it from userspace and we wouldn't ever have this discussion. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB 2021-11-18 21:58 ` Andrew Morton @ 2021-11-19 7:41 ` Drew DeVault 0 siblings, 0 replies; 58+ messages in thread From: Drew DeVault @ 2021-11-19 7:41 UTC (permalink / raw) To: Andrew Morton, Jens Axboe Cc: Johannes Weiner, Ammar Faizi, linux-kernel, linux-api, io_uring Mailing List, Pavel Begunkov, linux-mm On Thu Nov 18, 2021 at 10:58 PM CET, Andrew Morton wrote: > Nobody's aiming for perfection. We're discussing aiming for "better". > > What we should have done on day one was to set the default MLOCK_LIMIT > to zero bytes. Then everyone would have infrastructure to tune it from > userspace and we wouldn't ever have this discussion. Setting aside perfection or not, what you're aiming for is about 1000× more work. I'm not prepared to do that work. I'm not going to paint this same bikeshed 100 times for each Linux distro we have to convince to adopt a more sophisticated solution. ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2021-11-30 15:53 UTC | newest] Thread overview: 58+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-10-28 8:08 [PATCH] Increase default MLOCK_LIMIT to 8 MiB Drew DeVault 2021-10-28 18:22 ` Jens Axboe 2021-11-04 14:27 ` Cyril Hrubis 2021-11-04 14:44 ` Jens Axboe 2021-11-06 2:33 ` Ammar Faizi 2021-11-06 7:05 ` Drew DeVault 2021-11-06 7:12 ` Ammar Faizi 2021-11-16 4:35 ` Andrew Morton 2021-11-16 6:32 ` Drew DeVault 2021-11-16 19:47 ` Andrew Morton 2021-11-16 19:48 ` Drew DeVault 2021-11-16 21:37 ` Andrew Morton 2021-11-17 8:23 ` Drew DeVault 2021-11-22 17:11 ` David Hildenbrand 2021-11-22 17:55 ` Andrew Dona-Couch 2021-11-22 18:26 ` David Hildenbrand 2021-11-22 19:53 ` Jens Axboe 2021-11-22 20:03 ` Matthew Wilcox 2021-11-22 20:04 ` Jens Axboe 2021-11-22 20:08 ` David Hildenbrand 2021-11-22 20:44 ` Jens Axboe 2021-11-22 21:56 ` David Hildenbrand 2021-11-23 12:02 ` David Hildenbrand 2021-11-23 13:25 ` Jason Gunthorpe 2021-11-23 13:39 ` David Hildenbrand 2021-11-23 14:07 ` Jason Gunthorpe 2021-11-23 14:44 ` David Hildenbrand 2021-11-23 17:00 ` Jason Gunthorpe 2021-11-23 17:04 ` David Hildenbrand 2021-11-23 22:04 ` Vlastimil Babka 2021-11-23 23:59 ` Jason Gunthorpe 2021-11-24 8:57 ` David Hildenbrand 2021-11-24 13:23 ` Jason Gunthorpe 2021-11-24 13:25 ` David Hildenbrand 2021-11-24 13:28 ` Jason Gunthorpe 2021-11-24 13:29 ` David Hildenbrand 2021-11-24 13:48 ` Jason Gunthorpe 2021-11-24 14:14 ` David Hildenbrand 2021-11-24 15:34 ` Jason Gunthorpe 2021-11-24 16:43 ` David Hildenbrand 2021-11-24 18:35 ` Jason Gunthorpe 2021-11-24 19:09 ` David Hildenbrand 2021-11-24 23:11 ` Jason Gunthorpe 2021-11-30 15:52 ` David Hildenbrand 2021-11-24 18:37 ` David Hildenbrand 2021-11-24 14:37 ` Vlastimil Babka 2021-11-24 14:41 ` David Hildenbrand 2021-11-16 18:36 ` Matthew Wilcox 2021-11-16 18:44 ` Drew DeVault 2021-11-16 18:55 ` Jens Axboe 2021-11-16 19:21 ` Vito Caputo 2021-11-16 19:25 ` Drew DeVault 2021-11-16 19:46 ` Vito Caputo 2021-11-16 19:41 ` Jens Axboe 2021-11-17 22:26 ` Johannes Weiner 2021-11-17 23:17 ` Jens Axboe 2021-11-18 21:58 ` Andrew Morton 2021-11-19 7:41 ` Drew DeVault
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox