public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
@ 2025-01-11  0:46 SeongJae Park
  2025-01-14 18:13 ` Shakeel Butt
  2025-01-15  4:35 ` Honggyu Kim
  0 siblings, 2 replies; 10+ messages in thread
From: SeongJae Park @ 2025-01-11  0:46 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm

process_madvise() calls do_madvise() for each address range.  Then, each
do_madvise() invocation holds and releases same mmap_lock.  Optimize the
redundant lock operations by doing the locking in process_madvise(), and
inform do_madvise() that the lock is already held and therefore can be
skipped.

Evaluation
==========

I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
using multiple madvise() calls, 4 KiB per each call.  I also do the same
with process_madvise(), but with varying iovec size from 1 to 1024.
The source code for the measurement is available at GitHub[1].

The measurement results are as below.  'sz_batches' column shows the
iovec size of process_madvise() calls.  '0' is for madvise() calls case.
'before' and 'after' columns are the measured time to apply
MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
that built without and with this patch, respectively.  So lower value
means better efficiency.  'after/before' column is the ratio of 'after'
to 'before'.

    sz_batches  before     after      after/before
    0           124062365  96670188   0.779206393494111
    1           136341258  113915688  0.835518827323714
    2           105314942  78898211   0.749164453796119
    4           82012858   59778998   0.728897875989153
    8           82562651   51003069   0.617749895167489
    16          71474930   47575960   0.665631431888076
    32          71391211   42902076   0.600943385033768
    64          68225932   41337835   0.605896230190011
    128         71053578   42467240   0.597679120395598
    256         85094126   41630463   0.489228398679364
    512         68531628   44049763   0.6427654542221
    1024        79338892   43370866   0.546653285755491

The measurement shows this patch reduces the process_madvise() latency,
proportional to the batching size, from about 25% with the batch size 2
to about 55% with the batch size 1,024.  The trend is somewhat we can
expect.

Interestingly, this patch has also optimize madvise() and single batch
size process_madvise(), though.  I ran this test multiple times, but the
results are consistent.  I'm still investigating if there are something
I'm missing.  But I believe the investigation may not necessarily be a
blocker of this RFC, so just posting this.  I will add updates of the
madvise() and single batch size process_madvise() investigation later.

[1] https://github.com/sjp38/eval_proc_madvise

Signed-off-by: SeongJae Park <[email protected]>
---
 include/linux/mm.h |  3 ++-
 io_uring/advise.c  |  2 +-
 mm/damon/vaddr.c   |  2 +-
 mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
 4 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 612b513ebfbd..e3ca5967ebd4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    unsigned long end, struct list_head *uf, bool unlock);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
-extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+		int behavior, bool lock_held);
 
 #ifdef CONFIG_MMU
 extern int __mm_populate(unsigned long addr, unsigned long len,
diff --git a/io_uring/advise.c b/io_uring/advise.c
index cb7b881665e5..010b55d5a26e 100644
--- a/io_uring/advise.c
+++ b/io_uring/advise.c
@@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
 
 	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
 
-	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
+	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 #else
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index a6174f725bd7..30b5a251d73e 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -646,7 +646,7 @@ static unsigned long damos_madvise(struct damon_target *target,
 	if (!mm)
 		return 0;
 
-	applied = do_madvise(mm, start, len, behavior) ? 0 : len;
+	applied = do_madvise(mm, start, len, behavior, false) ? 0 : len;
 	mmput(mm);
 
 	return applied;
diff --git a/mm/madvise.c b/mm/madvise.c
index 49f3a75046f6..c107376db9d5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1637,7 +1637,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  -EAGAIN - a kernel resource was temporarily unavailable.
  *  -EPERM  - memory is sealed.
  */
-int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
+		int behavior, bool lock_held)
 {
 	unsigned long end;
 	int error;
@@ -1668,12 +1669,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 		return madvise_inject_error(behavior, start, start + len_in);
 #endif
 
-	write = madvise_need_mmap_write(behavior);
-	if (write) {
-		if (mmap_write_lock_killable(mm))
-			return -EINTR;
-	} else {
-		mmap_read_lock(mm);
+	if (!lock_held) {
+		write = madvise_need_mmap_write(behavior);
+		if (write) {
+			if (mmap_write_lock_killable(mm))
+				return -EINTR;
+		} else {
+			mmap_read_lock(mm);
+		}
 	}
 
 	start = untagged_addr_remote(mm, start);
@@ -1692,17 +1695,19 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	}
 	blk_finish_plug(&plug);
 
-	if (write)
-		mmap_write_unlock(mm);
-	else
-		mmap_read_unlock(mm);
+	if (!lock_held) {
+		if (write)
+			mmap_write_unlock(mm);
+		else
+			mmap_read_unlock(mm);
+	}
 
 	return error;
 }
 
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
-	return do_madvise(current->mm, start, len_in, behavior);
+	return do_madvise(current->mm, start, len_in, behavior, false);
 }
 
 /* Perform an madvise operation over a vector of addresses and lengths. */
@@ -1711,12 +1716,28 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 {
 	ssize_t ret = 0;
 	size_t total_len;
+	bool hold_lock = true;
+	int write;
 
 	total_len = iov_iter_count(iter);
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
+		hold_lock = false;
+#endif
+	if (hold_lock) {
+		write = madvise_need_mmap_write(behavior);
+		if (write) {
+			if (mmap_write_lock_killable(mm))
+				return -EINTR;
+		} else {
+			mmap_read_lock(mm);
+		}
+	}
+
 	while (iov_iter_count(iter)) {
 		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
-				 iter_iov_len(iter), behavior);
+				 iter_iov_len(iter), behavior, hold_lock);
 		/*
 		 * An madvise operation is attempting to restart the syscall,
 		 * but we cannot proceed as it would not be correct to repeat
@@ -1739,6 +1760,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
 		iov_iter_advance(iter, iter_iov_len(iter));
 	}
 
+	if (hold_lock) {
+		if (write)
+			mmap_write_unlock(mm);
+		else
+			mmap_read_unlock(mm);
+	}
+
 	ret = (total_len - iov_iter_count(iter)) ? : ret;
 
 	return ret;
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-11  0:46 [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
@ 2025-01-14 18:13 ` Shakeel Butt
  2025-01-14 18:47   ` Lorenzo Stoakes
  2025-01-15  3:44   ` Liam R. Howlett
  2025-01-15  4:35 ` Honggyu Kim
  1 sibling, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-01-14 18:13 UTC (permalink / raw)
  To: SeongJae Park, David Hildenbrand, Lorenzo Stoakes, Liam.Howlett,
	Vlastimil Babka
  Cc: Andrew Morton, Jens Axboe, Pavel Begunkov, damon, io-uring,
	linux-kernel, linux-mm

Ccing relevant folks.

On Fri, Jan 10, 2025 at 04:46:18PM -0800, SeongJae Park wrote:
> process_madvise() calls do_madvise() for each address range.  Then, each
> do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> redundant lock operations by doing the locking in process_madvise(), and
> inform do_madvise() that the lock is already held and therefore can be
> skipped.
> 
> Evaluation
> ==========
> 
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple madvise() calls, 4 KiB per each call.  I also do the same
> with process_madvise(), but with varying iovec size from 1 to 1024.
> The source code for the measurement is available at GitHub[1].
> 
> The measurement results are as below.  'sz_batches' column shows the
> iovec size of process_madvise() calls.  '0' is for madvise() calls case.
> 'before' and 'after' columns are the measured time to apply
> MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> that built without and with this patch, respectively.  So lower value
> means better efficiency.  'after/before' column is the ratio of 'after'
> to 'before'.
> 
>     sz_batches  before     after      after/before
>     0           124062365  96670188   0.779206393494111
>     1           136341258  113915688  0.835518827323714
>     2           105314942  78898211   0.749164453796119
>     4           82012858   59778998   0.728897875989153
>     8           82562651   51003069   0.617749895167489
>     16          71474930   47575960   0.665631431888076
>     32          71391211   42902076   0.600943385033768
>     64          68225932   41337835   0.605896230190011
>     128         71053578   42467240   0.597679120395598
>     256         85094126   41630463   0.489228398679364
>     512         68531628   44049763   0.6427654542221
>     1024        79338892   43370866   0.546653285755491
> 
> The measurement shows this patch reduces the process_madvise() latency,
> proportional to the batching size, from about 25% with the batch size 2
> to about 55% with the batch size 1,024.  The trend is somewhat we can
> expect.
> 
> Interestingly, this patch has also optimize madvise() and single batch
> size process_madvise(), though.  I ran this test multiple times, but the
> results are consistent.  I'm still investigating if there are something
> I'm missing.  But I believe the investigation may not necessarily be a
> blocker of this RFC, so just posting this.  I will add updates of the
> madvise() and single batch size process_madvise() investigation later.
> 
> [1] https://github.com/sjp38/eval_proc_madvise
> 
> Signed-off-by: SeongJae Park <[email protected]>
> ---
>  include/linux/mm.h |  3 ++-
>  io_uring/advise.c  |  2 +-
>  mm/damon/vaddr.c   |  2 +-
>  mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
>  4 files changed, 45 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 612b513ebfbd..e3ca5967ebd4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    unsigned long end, struct list_head *uf, bool unlock);
>  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>  		     struct list_head *uf);
> -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool lock_held);
>  
>  #ifdef CONFIG_MMU
>  extern int __mm_populate(unsigned long addr, unsigned long len,
> diff --git a/io_uring/advise.c b/io_uring/advise.c
> index cb7b881665e5..010b55d5a26e 100644
> --- a/io_uring/advise.c
> +++ b/io_uring/advise.c
> @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
>  
>  	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
>  
> -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
>  	io_req_set_res(req, ret, 0);
>  	return IOU_OK;
>  #else
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index a6174f725bd7..30b5a251d73e 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -646,7 +646,7 @@ static unsigned long damos_madvise(struct damon_target *target,
>  	if (!mm)
>  		return 0;
>  
> -	applied = do_madvise(mm, start, len, behavior) ? 0 : len;
> +	applied = do_madvise(mm, start, len, behavior, false) ? 0 : len;
>  	mmput(mm);
>  
>  	return applied;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49f3a75046f6..c107376db9d5 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1637,7 +1637,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *  -EAGAIN - a kernel resource was temporarily unavailable.
>   *  -EPERM  - memory is sealed.
>   */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool lock_held)
>  {
>  	unsigned long end;
>  	int error;
> @@ -1668,12 +1669,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  		return madvise_inject_error(behavior, start, start + len_in);
>  #endif
>  
> -	write = madvise_need_mmap_write(behavior);
> -	if (write) {
> -		if (mmap_write_lock_killable(mm))
> -			return -EINTR;
> -	} else {
> -		mmap_read_lock(mm);
> +	if (!lock_held) {
> +		write = madvise_need_mmap_write(behavior);
> +		if (write) {
> +			if (mmap_write_lock_killable(mm))
> +				return -EINTR;
> +		} else {
> +			mmap_read_lock(mm);
> +		}
>  	}
>  
>  	start = untagged_addr_remote(mm, start);
> @@ -1692,17 +1695,19 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>  	}
>  	blk_finish_plug(&plug);
>  
> -	if (write)
> -		mmap_write_unlock(mm);
> -	else
> -		mmap_read_unlock(mm);
> +	if (!lock_held) {
> +		if (write)
> +			mmap_write_unlock(mm);
> +		else
> +			mmap_read_unlock(mm);
> +	}
>  
>  	return error;
>  }
>  
>  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
> -	return do_madvise(current->mm, start, len_in, behavior);
> +	return do_madvise(current->mm, start, len_in, behavior, false);
>  }
>  
>  /* Perform an madvise operation over a vector of addresses and lengths. */
> @@ -1711,12 +1716,28 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  {
>  	ssize_t ret = 0;
>  	size_t total_len;
> +	bool hold_lock = true;
> +	int write;
>  
>  	total_len = iov_iter_count(iter);
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> +		hold_lock = false;
> +#endif
> +	if (hold_lock) {
> +		write = madvise_need_mmap_write(behavior);
> +		if (write) {
> +			if (mmap_write_lock_killable(mm))
> +				return -EINTR;
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
>  	while (iov_iter_count(iter)) {
>  		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> -				 iter_iov_len(iter), behavior);
> +				 iter_iov_len(iter), behavior, hold_lock);
>  		/*
>  		 * An madvise operation is attempting to restart the syscall,
>  		 * but we cannot proceed as it would not be correct to repeat
> @@ -1739,6 +1760,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>  		iov_iter_advance(iter, iter_iov_len(iter));
>  	}
>  
> +	if (hold_lock) {
> +		if (write)
> +			mmap_write_unlock(mm);
> +		else
> +			mmap_read_unlock(mm);
> +	}
> +
>  	ret = (total_len - iov_iter_count(iter)) ? : ret;
>  
>  	return ret;
> -- 
> 2.39.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-14 18:13 ` Shakeel Butt
@ 2025-01-14 18:47   ` Lorenzo Stoakes
  2025-01-14 19:54     ` SeongJae Park
  2025-01-14 21:55     ` Shakeel Butt
  2025-01-15  3:44   ` Liam R. Howlett
  1 sibling, 2 replies; 10+ messages in thread
From: Lorenzo Stoakes @ 2025-01-14 18:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, David Hildenbrand, Liam.Howlett, Vlastimil Babka,
	Andrew Morton, Jens Axboe, Pavel Begunkov, damon, io-uring,
	linux-kernel, linux-mm

On Tue, Jan 14, 2025 at 10:13:40AM -0800, Shakeel Butt wrote:
> Ccing relevant folks.

Thanks Shakeel!

A side-note, I really wish there was a better way to get cc'd, since I
fundamentally changed process_madvise() recently and was the main person
changing this code lately, but on the other hand -
scripts/get_maintainers.pl gets really really noisy if you try to use this
kind of stat - so I in no way blame SJ for missing me.

Thankfully Shakeel kindly stepped in to make me aware :)

SJ - I will come back to you later, as it's late here and my brain is fried
- but I was already thinking of doing something _like_ this, as I noticed
for the purposes of self-process_madvise() operations (which I unrestricted
for guard page purposes) - we are hammering locks in a way that we know we
don't necessarily need to do.

So this is serendipitous for me! :) But I need to dig into your actual
implementation to give feedback here.

Will come back to this in due course :)

>
> On Fri, Jan 10, 2025 at 04:46:18PM -0800, SeongJae Park wrote:
> > process_madvise() calls do_madvise() for each address range.  Then, each
> > do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> > redundant lock operations by doing the locking in process_madvise(), and
> > inform do_madvise() that the lock is already held and therefore can be
> > skipped.
> >
> > Evaluation
> > ==========
> >
> > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple madvise() calls, 4 KiB per each call.  I also do the same
> > with process_madvise(), but with varying iovec size from 1 to 1024.
> > The source code for the measurement is available at GitHub[1].
> >
> > The measurement results are as below.  'sz_batches' column shows the
> > iovec size of process_madvise() calls.  '0' is for madvise() calls case.
> > 'before' and 'after' columns are the measured time to apply
> > MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> > that built without and with this patch, respectively.  So lower value
> > means better efficiency.  'after/before' column is the ratio of 'after'
> > to 'before'.
> >
> >     sz_batches  before     after      after/before
> >     0           124062365  96670188   0.779206393494111
> >     1           136341258  113915688  0.835518827323714
> >     2           105314942  78898211   0.749164453796119
> >     4           82012858   59778998   0.728897875989153
> >     8           82562651   51003069   0.617749895167489
> >     16          71474930   47575960   0.665631431888076
> >     32          71391211   42902076   0.600943385033768
> >     64          68225932   41337835   0.605896230190011
> >     128         71053578   42467240   0.597679120395598
> >     256         85094126   41630463   0.489228398679364
> >     512         68531628   44049763   0.6427654542221
> >     1024        79338892   43370866   0.546653285755491
> >
> > The measurement shows this patch reduces the process_madvise() latency,
> > proportional to the batching size, from about 25% with the batch size 2
> > to about 55% with the batch size 1,024.  The trend is somewhat we can
> > expect.
> >
> > Interestingly, this patch has also optimize madvise() and single batch
> > size process_madvise(), though.  I ran this test multiple times, but the
> > results are consistent.  I'm still investigating if there are something
> > I'm missing.  But I believe the investigation may not necessarily be a
> > blocker of this RFC, so just posting this.  I will add updates of the
> > madvise() and single batch size process_madvise() investigation later.
> >
> > [1] https://github.com/sjp38/eval_proc_madvise
> >
> > Signed-off-by: SeongJae Park <[email protected]>
> > ---
> >  include/linux/mm.h |  3 ++-
> >  io_uring/advise.c  |  2 +-
> >  mm/damon/vaddr.c   |  2 +-
> >  mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
> >  4 files changed, 45 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 612b513ebfbd..e3ca5967ebd4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		    unsigned long end, struct list_head *uf, bool unlock);
> >  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> >  		     struct list_head *uf);
> > -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +		int behavior, bool lock_held);
> >
> >  #ifdef CONFIG_MMU
> >  extern int __mm_populate(unsigned long addr, unsigned long len,
> > diff --git a/io_uring/advise.c b/io_uring/advise.c
> > index cb7b881665e5..010b55d5a26e 100644
> > --- a/io_uring/advise.c
> > +++ b/io_uring/advise.c
> > @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
> >
> >  	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
> >
> > -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> > +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
> >  	io_req_set_res(req, ret, 0);
> >  	return IOU_OK;
> >  #else
> > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> > index a6174f725bd7..30b5a251d73e 100644
> > --- a/mm/damon/vaddr.c
> > +++ b/mm/damon/vaddr.c
> > @@ -646,7 +646,7 @@ static unsigned long damos_madvise(struct damon_target *target,
> >  	if (!mm)
> >  		return 0;
> >
> > -	applied = do_madvise(mm, start, len, behavior) ? 0 : len;
> > +	applied = do_madvise(mm, start, len, behavior, false) ? 0 : len;
> >  	mmput(mm);
> >
> >  	return applied;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 49f3a75046f6..c107376db9d5 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1637,7 +1637,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  -EAGAIN - a kernel resource was temporarily unavailable.
> >   *  -EPERM  - memory is sealed.
> >   */
> > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +		int behavior, bool lock_held)
> >  {
> >  	unsigned long end;
> >  	int error;
> > @@ -1668,12 +1669,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >  		return madvise_inject_error(behavior, start, start + len_in);
> >  #endif
> >
> > -	write = madvise_need_mmap_write(behavior);
> > -	if (write) {
> > -		if (mmap_write_lock_killable(mm))
> > -			return -EINTR;
> > -	} else {
> > -		mmap_read_lock(mm);
> > +	if (!lock_held) {
> > +		write = madvise_need_mmap_write(behavior);
> > +		if (write) {
> > +			if (mmap_write_lock_killable(mm))
> > +				return -EINTR;
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> >  	}
> >
> >  	start = untagged_addr_remote(mm, start);
> > @@ -1692,17 +1695,19 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >  	}
> >  	blk_finish_plug(&plug);
> >
> > -	if (write)
> > -		mmap_write_unlock(mm);
> > -	else
> > -		mmap_read_unlock(mm);
> > +	if (!lock_held) {
> > +		if (write)
> > +			mmap_write_unlock(mm);
> > +		else
> > +			mmap_read_unlock(mm);
> > +	}
> >
> >  	return error;
> >  }
> >
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -	return do_madvise(current->mm, start, len_in, behavior);
> > +	return do_madvise(current->mm, start, len_in, behavior, false);
> >  }
> >
> >  /* Perform an madvise operation over a vector of addresses and lengths. */
> > @@ -1711,12 +1716,28 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  {
> >  	ssize_t ret = 0;
> >  	size_t total_len;
> > +	bool hold_lock = true;
> > +	int write;
> >
> >  	total_len = iov_iter_count(iter);
> >
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > +		hold_lock = false;
> > +#endif
> > +	if (hold_lock) {
> > +		write = madvise_need_mmap_write(behavior);
> > +		if (write) {
> > +			if (mmap_write_lock_killable(mm))
> > +				return -EINTR;
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> >  	while (iov_iter_count(iter)) {
> >  		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> > -				 iter_iov_len(iter), behavior);
> > +				 iter_iov_len(iter), behavior, hold_lock);
> >  		/*
> >  		 * An madvise operation is attempting to restart the syscall,
> >  		 * but we cannot proceed as it would not be correct to repeat
> > @@ -1739,6 +1760,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> >
> > +	if (hold_lock) {
> > +		if (write)
> > +			mmap_write_unlock(mm);
> > +		else
> > +			mmap_read_unlock(mm);
> > +	}
> > +
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> >
> >  	return ret;
> > --
> > 2.39.5

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-14 18:47   ` Lorenzo Stoakes
@ 2025-01-14 19:54     ` SeongJae Park
  2025-01-14 21:55     ` Shakeel Butt
  1 sibling, 0 replies; 10+ messages in thread
From: SeongJae Park @ 2025-01-14 19:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, Shakeel Butt, David Hildenbrand, Liam.Howlett,
	Vlastimil Babka, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm

On Tue, 14 Jan 2025 18:47:15 +0000 Lorenzo Stoakes <[email protected]> wrote:

> On Tue, Jan 14, 2025 at 10:13:40AM -0800, Shakeel Butt wrote:
> > Ccing relevant folks.
> 
> Thanks Shakeel!

Thank you Shakeel, too!

> 
> A side-note, I really wish there was a better way to get cc'd, since I
> fundamentally changed process_madvise() recently and was the main person
> changing this code lately, but on the other hand -
> scripts/get_maintainers.pl gets really really noisy if you try to use this
> kind of stat - so I in no way blame SJ for missing me.

Yes, I always feeling finding not too many, not too less, but only appropriate
recipients for patches is not easy.  Just FYI, I use get_maintainers.pl with
--nogit option[1] and add more recipients based on additional logics[2] that
based on my past experiences and discussions, by default.  And then I run
get_maintainers.pl without --nogit option if I get no response more than I
expected.

I will keep Shakeel-aded recipients for next spins of this patch, anyway.

> 
> Thankfully Shakeel kindly stepped in to make me aware :)
> 
> SJ - I will come back to you later, as it's late here and my brain is fried
> - but I was already thinking of doing something _like_ this, as I noticed
> for the purposes of self-process_madvise() operations (which I unrestricted
> for guard page purposes) - we are hammering locks in a way that we know we
> don't necessarily need to do.
> 
> So this is serendipitous for me! :) But I need to dig into your actual
> implementation to give feedback here.
> 
> Will come back to this in due course :)

No worry, no rush.  Please take your time :)

[1] https://github.com/sjp38/hackermail/blob/master/src/hkml_patch_format.py#L45
[2] https://github.com/sjp38/hackermail/blob/master/src/hkml_patch_format.py#L31


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-14 18:47   ` Lorenzo Stoakes
  2025-01-14 19:54     ` SeongJae Park
@ 2025-01-14 21:55     ` Shakeel Butt
  1 sibling, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-01-14 21:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: SeongJae Park, David Hildenbrand, Liam.Howlett, Vlastimil Babka,
	Andrew Morton, Jens Axboe, Pavel Begunkov, damon, io-uring,
	linux-kernel, linux-mm

On Tue, Jan 14, 2025 at 06:47:15PM +0000, Lorenzo Stoakes wrote:
> On Tue, Jan 14, 2025 at 10:13:40AM -0800, Shakeel Butt wrote:
> > Ccing relevant folks.
> 
> Thanks Shakeel!
> 
> A side-note, I really wish there was a better way to get cc'd, since I
> fundamentally changed process_madvise() recently and was the main person
> changing this code lately, but on the other hand -
> scripts/get_maintainers.pl gets really really noisy if you try to use this
> kind of stat - so I in no way blame SJ for missing me.
> 
> Thankfully Shakeel kindly stepped in to make me aware :)
> 
> SJ - I will come back to you later, as it's late here and my brain is fried
> - but I was already thinking of doing something _like_ this, as I noticed
> for the purposes of self-process_madvise() operations (which I unrestricted
> for guard page purposes) - we are hammering locks in a way that we know we
> don't necessarily need to do.
> 
> So this is serendipitous for me! :) But I need to dig into your actual
> implementation to give feedback here.
> 
> Will come back to this in due course :)
> 

SJ is planning to do couple more optimizations like single tree
traversal (where possible) and batching TLB flushing for advices which
does TLB flushing. It is better to coordinate the work than orthogonally
repeating the work.

Thanks Lorenzo for your time.

Shakeel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-14 18:13 ` Shakeel Butt
  2025-01-14 18:47   ` Lorenzo Stoakes
@ 2025-01-15  3:44   ` Liam R. Howlett
  2025-01-15  4:17     ` SeongJae Park
  1 sibling, 1 reply; 10+ messages in thread
From: Liam R. Howlett @ 2025-01-15  3:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: SeongJae Park, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm

* Shakeel Butt <[email protected]> [250114 13:14]:
> Ccing relevant folks.

Thanks Shakeel.

> 
> On Fri, Jan 10, 2025 at 04:46:18PM -0800, SeongJae Park wrote:
> > process_madvise() calls do_madvise() for each address range.  Then, each
> > do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> > redundant lock operations by doing the locking in process_madvise(), and
> > inform do_madvise() that the lock is already held and therefore can be
> > skipped.
> > 
> > Evaluation
> > ==========
> > 
> > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple madvise() calls, 4 KiB per each call.  I also do the same
> > with process_madvise(), but with varying iovec size from 1 to 1024.
> > The source code for the measurement is available at GitHub[1].
> > 
> > The measurement results are as below.  'sz_batches' column shows the
> > iovec size of process_madvise() calls.  '0' is for madvise() calls case.
> > 'before' and 'after' columns are the measured time to apply
> > MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> > that built without and with this patch, respectively.  So lower value
> > means better efficiency.  'after/before' column is the ratio of 'after'
> > to 'before'.
> > 
> >     sz_batches  before     after      after/before
> >     0           124062365  96670188   0.779206393494111
> >     1           136341258  113915688  0.835518827323714
> >     2           105314942  78898211   0.749164453796119
> >     4           82012858   59778998   0.728897875989153
> >     8           82562651   51003069   0.617749895167489
> >     16          71474930   47575960   0.665631431888076
> >     32          71391211   42902076   0.600943385033768
> >     64          68225932   41337835   0.605896230190011
> >     128         71053578   42467240   0.597679120395598
> >     256         85094126   41630463   0.489228398679364
> >     512         68531628   44049763   0.6427654542221
> >     1024        79338892   43370866   0.546653285755491
> > 
> > The measurement shows this patch reduces the process_madvise() latency,
> > proportional to the batching size, from about 25% with the batch size 2
> > to about 55% with the batch size 1,024.  The trend is somewhat we can
> > expect.
> > 
> > Interestingly, this patch has also optimize madvise() and single batch
> > size process_madvise(), though.  I ran this test multiple times, but the
> > results are consistent.  I'm still investigating if there are something
> > I'm missing.  But I believe the investigation may not necessarily be a
> > blocker of this RFC, so just posting this.  I will add updates of the
> > madvise() and single batch size process_madvise() investigation later.
> > 
> > [1] https://github.com/sjp38/eval_proc_madvise
> > 
> > Signed-off-by: SeongJae Park <[email protected]>
> > ---
> >  include/linux/mm.h |  3 ++-
> >  io_uring/advise.c  |  2 +-
> >  mm/damon/vaddr.c   |  2 +-
> >  mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
> >  4 files changed, 45 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 612b513ebfbd..e3ca5967ebd4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		    unsigned long end, struct list_head *uf, bool unlock);
> >  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> >  		     struct list_head *uf);
> > -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +		int behavior, bool lock_held);

We are dropping externs when it is not needed as things are changed.

Also, please don't use a flags for this.  It will have a single user of
true, probably ever.

It might be better to break do_madvise up into more parts:
1. is_madvise_valid(), which does the checking.
2. madivse_do_behavior()

The locking type is already extracted to madivse_need_mmap_write().

What do you think?


> >  
> >  #ifdef CONFIG_MMU
> >  extern int __mm_populate(unsigned long addr, unsigned long len,
> > diff --git a/io_uring/advise.c b/io_uring/advise.c
> > index cb7b881665e5..010b55d5a26e 100644
> > --- a/io_uring/advise.c
> > +++ b/io_uring/advise.c
> > @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
> >  
> >  	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
> >  
> > -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> > +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
> >  	io_req_set_res(req, ret, 0);
> >  	return IOU_OK;
> >  #else
> > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> > index a6174f725bd7..30b5a251d73e 100644
> > --- a/mm/damon/vaddr.c
> > +++ b/mm/damon/vaddr.c
> > @@ -646,7 +646,7 @@ static unsigned long damos_madvise(struct damon_target *target,
> >  	if (!mm)
> >  		return 0;
> >  
> > -	applied = do_madvise(mm, start, len, behavior) ? 0 : len;
> > +	applied = do_madvise(mm, start, len, behavior, false) ? 0 : len;
> >  	mmput(mm);
> >  
> >  	return applied;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 49f3a75046f6..c107376db9d5 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1637,7 +1637,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> >   *  -EAGAIN - a kernel resource was temporarily unavailable.
> >   *  -EPERM  - memory is sealed.
> >   */
> > -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> > +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +		int behavior, bool lock_held)
> >  {
> >  	unsigned long end;
> >  	int error;
> > @@ -1668,12 +1669,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >  		return madvise_inject_error(behavior, start, start + len_in);
> >  #endif
> >  
> > -	write = madvise_need_mmap_write(behavior);
> > -	if (write) {
> > -		if (mmap_write_lock_killable(mm))
> > -			return -EINTR;
> > -	} else {
> > -		mmap_read_lock(mm);
> > +	if (!lock_held) {
> > +		write = madvise_need_mmap_write(behavior);
> > +		if (write) {
> > +			if (mmap_write_lock_killable(mm))
> > +				return -EINTR;
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> >  	}
> >  
> >  	start = untagged_addr_remote(mm, start);
> > @@ -1692,17 +1695,19 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> >  	}
> >  	blk_finish_plug(&plug);
> >  
> > -	if (write)
> > -		mmap_write_unlock(mm);
> > -	else
> > -		mmap_read_unlock(mm);
> > +	if (!lock_held) {
> > +		if (write)
> > +			mmap_write_unlock(mm);
> > +		else
> > +			mmap_read_unlock(mm);
> > +	}
> >  
> >  	return error;
> >  }
> >  
> >  SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> > -	return do_madvise(current->mm, start, len_in, behavior);
> > +	return do_madvise(current->mm, start, len_in, behavior, false);
> >  }
> >  
> >  /* Perform an madvise operation over a vector of addresses and lengths. */
> > @@ -1711,12 +1716,28 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  {
> >  	ssize_t ret = 0;
> >  	size_t total_len;
> > +	bool hold_lock = true;
> > +	int write;
> >  
> >  	total_len = iov_iter_count(iter);
> >  
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > +		hold_lock = false;
> > +#endif
> > +	if (hold_lock) {
> > +		write = madvise_need_mmap_write(behavior);
> > +		if (write) {
> > +			if (mmap_write_lock_killable(mm))
> > +				return -EINTR;
> > +		} else {
> > +			mmap_read_lock(mm);
> > +		}
> > +	}
> > +
> >  	while (iov_iter_count(iter)) {
> >  		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> > -				 iter_iov_len(iter), behavior);
> > +				 iter_iov_len(iter), behavior, hold_lock);
> >  		/*
> >  		 * An madvise operation is attempting to restart the syscall,
> >  		 * but we cannot proceed as it would not be correct to repeat
> > @@ -1739,6 +1760,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
> >  		iov_iter_advance(iter, iter_iov_len(iter));
> >  	}
> >  
> > +	if (hold_lock) {
> > +		if (write)
> > +			mmap_write_unlock(mm);
> > +		else
> > +			mmap_read_unlock(mm);
> > +	}
> > +
> >  	ret = (total_len - iov_iter_count(iter)) ? : ret;
> >  
> >  	return ret;
> > -- 
> > 2.39.5
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-15  3:44   ` Liam R. Howlett
@ 2025-01-15  4:17     ` SeongJae Park
  0 siblings, 0 replies; 10+ messages in thread
From: SeongJae Park @ 2025-01-15  4:17 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: SeongJae Park, Shakeel Butt, David Hildenbrand, Lorenzo Stoakes,
	Vlastimil Babka, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm

On Tue, 14 Jan 2025 22:44:53 -0500 "Liam R. Howlett" <[email protected]> wrote:

> * Shakeel Butt <[email protected]> [250114 13:14]:
> > Ccing relevant folks.
> 
> Thanks Shakeel.
> 
> > 
> > On Fri, Jan 10, 2025 at 04:46:18PM -0800, SeongJae Park wrote:
> > > process_madvise() calls do_madvise() for each address range.  Then, each
> > > do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> > > redundant lock operations by doing the locking in process_madvise(), and
> > > inform do_madvise() that the lock is already held and therefore can be
> > > skipped.
> > > 
> > > Evaluation
> > > ==========
> > > 
> > > I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> > > using multiple madvise() calls, 4 KiB per each call.  I also do the same
> > > with process_madvise(), but with varying iovec size from 1 to 1024.
> > > The source code for the measurement is available at GitHub[1].
> > > 
> > > The measurement results are as below.  'sz_batches' column shows the
> > > iovec size of process_madvise() calls.  '0' is for madvise() calls case.
> > > 'before' and 'after' columns are the measured time to apply
> > > MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> > > that built without and with this patch, respectively.  So lower value
> > > means better efficiency.  'after/before' column is the ratio of 'after'
> > > to 'before'.
> > > 
> > >     sz_batches  before     after      after/before
> > >     0           124062365  96670188   0.779206393494111
> > >     1           136341258  113915688  0.835518827323714
> > >     2           105314942  78898211   0.749164453796119
> > >     4           82012858   59778998   0.728897875989153
> > >     8           82562651   51003069   0.617749895167489
> > >     16          71474930   47575960   0.665631431888076
> > >     32          71391211   42902076   0.600943385033768
> > >     64          68225932   41337835   0.605896230190011
> > >     128         71053578   42467240   0.597679120395598
> > >     256         85094126   41630463   0.489228398679364
> > >     512         68531628   44049763   0.6427654542221
> > >     1024        79338892   43370866   0.546653285755491
> > > 
> > > The measurement shows this patch reduces the process_madvise() latency,
> > > proportional to the batching size, from about 25% with the batch size 2
> > > to about 55% with the batch size 1,024.  The trend is somewhat we can
> > > expect.
> > > 
> > > Interestingly, this patch has also optimize madvise() and single batch
> > > size process_madvise(), though.  I ran this test multiple times, but the
> > > results are consistent.  I'm still investigating if there are something
> > > I'm missing.  But I believe the investigation may not necessarily be a
> > > blocker of this RFC, so just posting this.  I will add updates of the
> > > madvise() and single batch size process_madvise() investigation later.
> > > 
> > > [1] https://github.com/sjp38/eval_proc_madvise
> > > 
> > > Signed-off-by: SeongJae Park <[email protected]>
> > > ---
> > >  include/linux/mm.h |  3 ++-
> > >  io_uring/advise.c  |  2 +-
> > >  mm/damon/vaddr.c   |  2 +-
> > >  mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
> > >  4 files changed, 45 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 612b513ebfbd..e3ca5967ebd4 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >  		    unsigned long end, struct list_head *uf, bool unlock);
> > >  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> > >  		     struct list_head *uf);
> > > -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > > +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > > +		int behavior, bool lock_held);
> 
> We are dropping externs when it is not needed as things are changed.

Good point.  I will drop it if I result in changing something here in next
versions.

> 
> Also, please don't use a flags for this.  It will have a single user of
> true, probably ever.

Ok, that sounds fair.

> 
> It might be better to break do_madvise up into more parts:
> 1. is_madvise_valid(), which does the checking.
> 2. madivse_do_behavior()
> 
> The locking type is already extracted to madivse_need_mmap_write().
> 
> What do you think?

Sounds good to me :)

I will make the v2 of this patch following your suggestion after waiting for
any possible more comments a bit.


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-11  0:46 [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
  2025-01-14 18:13 ` Shakeel Butt
@ 2025-01-15  4:35 ` Honggyu Kim
  2025-01-15  6:19   ` SeongJae Park
  1 sibling, 1 reply; 10+ messages in thread
From: Honggyu Kim @ 2025-01-15  4:35 UTC (permalink / raw)
  To: SeongJae Park
  Cc: kernel_team, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm, lorenzo.stoakes

Hi SeongJae,

I have a simple comment on this.

On 1/11/2025 9:46 AM, SeongJae Park wrote:
> process_madvise() calls do_madvise() for each address range.  Then, each
> do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> redundant lock operations by doing the locking in process_madvise(), and
> inform do_madvise() that the lock is already held and therefore can be
> skipped.
>
> Evaluation
> ==========
>
> I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple madvise() calls, 4 KiB per each call.  I also do the same
> with process_madvise(), but with varying iovec size from 1 to 1024.
> The source code for the measurement is available at GitHub[1].
>
> The measurement results are as below.  'sz_batches' column shows the
> iovec size of process_madvise() calls.  '0' is for madvise() calls case.
> 'before' and 'after' columns are the measured time to apply
> MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels
> that built without and with this patch, respectively.  So lower value
> means better efficiency.  'after/before' column is the ratio of 'after'
> to 'before'.
>
>      sz_batches  before     after      after/before
>      0           124062365  96670188   0.779206393494111
>      1           136341258  113915688  0.835518827323714
>      2           105314942  78898211   0.749164453796119
>      4           82012858   59778998   0.728897875989153
>      8           82562651   51003069   0.617749895167489
>      16          71474930   47575960   0.665631431888076
>      32          71391211   42902076   0.600943385033768
>      64          68225932   41337835   0.605896230190011
>      128         71053578   42467240   0.597679120395598
>      256         85094126   41630463   0.489228398679364
>      512         68531628   44049763   0.6427654542221
>      1024        79338892   43370866   0.546653285755491
>
> The measurement shows this patch reduces the process_madvise() latency,
> proportional to the batching size, from about 25% with the batch size 2
> to about 55% with the batch size 1,024.  The trend is somewhat we can
> expect.
>
> Interestingly, this patch has also optimize madvise() and single batch
> size process_madvise(), though.  I ran this test multiple times, but the
> results are consistent.  I'm still investigating if there are something
> I'm missing.  But I believe the investigation may not necessarily be a
> blocker of this RFC, so just posting this.  I will add updates of the
> madvise() and single batch size process_madvise() investigation later.
>
> [1] https://github.com/sjp38/eval_proc_madvise
>
> Signed-off-by: SeongJae Park <[email protected]>
> ---
>   include/linux/mm.h |  3 ++-
>   io_uring/advise.c  |  2 +-
>   mm/damon/vaddr.c   |  2 +-
>   mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
>   4 files changed, 45 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 612b513ebfbd..e3ca5967ebd4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>   		    unsigned long end, struct list_head *uf, bool unlock);
>   extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>   		     struct list_head *uf);
> -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool lock_held);
>   
>   #ifdef CONFIG_MMU
>   extern int __mm_populate(unsigned long addr, unsigned long len,
> diff --git a/io_uring/advise.c b/io_uring/advise.c
> index cb7b881665e5..010b55d5a26e 100644
> --- a/io_uring/advise.c
> +++ b/io_uring/advise.c
> @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
>   
>   	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
>   
> -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);

I feel like this doesn't look good in terms of readability. Can we 
introduce an enum for this?

>   	io_req_set_res(req, ret, 0);
>   	return IOU_OK;
>   #else
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index a6174f725bd7..30b5a251d73e 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -646,7 +646,7 @@ static unsigned long damos_madvise(struct damon_target *target,
>   	if (!mm)
>   		return 0;
>   
> -	applied = do_madvise(mm, start, len, behavior) ? 0 : len;
> +	applied = do_madvise(mm, start, len, behavior, false) ? 0 : len;
>   	mmput(mm);
>   
>   	return applied;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49f3a75046f6..c107376db9d5 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1637,7 +1637,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>    *  -EAGAIN - a kernel resource was temporarily unavailable.
>    *  -EPERM  - memory is sealed.
>    */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> +		int behavior, bool lock_held)
>   {
>   	unsigned long end;
>   	int error;
> @@ -1668,12 +1669,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>   		return madvise_inject_error(behavior, start, start + len_in);
>   #endif
>   
> -	write = madvise_need_mmap_write(behavior);
> -	if (write) {
> -		if (mmap_write_lock_killable(mm))
> -			return -EINTR;
> -	} else {
> -		mmap_read_lock(mm);
> +	if (!lock_held) {
> +		write = madvise_need_mmap_write(behavior);
> +		if (write) {
> +			if (mmap_write_lock_killable(mm))
> +				return -EINTR;
> +		} else {
> +			mmap_read_lock(mm);
> +		}
>   	}
>   
>   	start = untagged_addr_remote(mm, start);
> @@ -1692,17 +1695,19 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
>   	}
>   	blk_finish_plug(&plug);
>   
> -	if (write)
> -		mmap_write_unlock(mm);
> -	else
> -		mmap_read_unlock(mm);
> +	if (!lock_held) {
> +		if (write)
> +			mmap_write_unlock(mm);
> +		else
> +			mmap_read_unlock(mm);
> +	}
>   
>   	return error;
>   }
>   
>   SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>   {
> -	return do_madvise(current->mm, start, len_in, behavior);
> +	return do_madvise(current->mm, start, len_in, behavior, false);
>   }
>   
>   /* Perform an madvise operation over a vector of addresses and lengths. */
> @@ -1711,12 +1716,28 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>   {
>   	ssize_t ret = 0;
>   	size_t total_len;
> +	bool hold_lock = true;
> +	int write;
>   
>   	total_len = iov_iter_count(iter);
>   
> +#ifdef CONFIG_MEMORY_FAILURE
> +	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> +		hold_lock = false;
> +#endif
> +	if (hold_lock) {
> +		write = madvise_need_mmap_write(behavior);
> +		if (write) {
> +			if (mmap_write_lock_killable(mm))
> +				return -EINTR;
> +		} else {
> +			mmap_read_lock(mm);
> +		}
> +	}
> +
>   	while (iov_iter_count(iter)) {
>   		ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter),
> -				 iter_iov_len(iter), behavior);
> +				 iter_iov_len(iter), behavior, hold_lock);
>   		/*
>   		 * An madvise operation is attempting to restart the syscall,
>   		 * but we cannot proceed as it would not be correct to repeat
> @@ -1739,6 +1760,13 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter,
>   		iov_iter_advance(iter, iter_iov_len(iter));
>   	}
>   
> +	if (hold_lock) {
> +		if (write)
> +			mmap_write_unlock(mm);
> +		else
> +			mmap_read_unlock(mm);
> +	}
> +
>   	ret = (total_len - iov_iter_count(iter)) ? : ret;
>   
>   	return ret;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-15  4:35 ` Honggyu Kim
@ 2025-01-15  6:19   ` SeongJae Park
  2025-01-15  7:15     ` Honggyu Kim
  0 siblings, 1 reply; 10+ messages in thread
From: SeongJae Park @ 2025-01-15  6:19 UTC (permalink / raw)
  To: Honggyu Kim
  Cc: SeongJae Park, kernel_team, Andrew Morton, Jens Axboe,
	Pavel Begunkov, damon, io-uring, linux-kernel, linux-mm,
	lorenzo.stoakes

Hi Honggyu,

On Wed, 15 Jan 2025 13:35:48 +0900 Honggyu Kim <[email protected]> wrote:

> Hi SeongJae,
> 
> I have a simple comment on this.
> 
> On 1/11/2025 9:46 AM, SeongJae Park wrote:
> > process_madvise() calls do_madvise() for each address range.  Then, each
> > do_madvise() invocation holds and releases same mmap_lock.  Optimize the
> > redundant lock operations by doing the locking in process_madvise(), and
> > inform do_madvise() that the lock is already held and therefore can be
> > skipped.
[...]
> > ---
> >   include/linux/mm.h |  3 ++-
> >   io_uring/advise.c  |  2 +-
> >   mm/damon/vaddr.c   |  2 +-
> >   mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
> >   4 files changed, 45 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 612b513ebfbd..e3ca5967ebd4 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >   		    unsigned long end, struct list_head *uf, bool unlock);
> >   extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> >   		     struct list_head *uf);
> > -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> > +		int behavior, bool lock_held);
> >   
> >   #ifdef CONFIG_MMU
> >   extern int __mm_populate(unsigned long addr, unsigned long len,
> > diff --git a/io_uring/advise.c b/io_uring/advise.c
> > index cb7b881665e5..010b55d5a26e 100644
> > --- a/io_uring/advise.c
> > +++ b/io_uring/advise.c
> > @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
> >   
> >   	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
> >   
> > -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
> > +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
> 
> I feel like this doesn't look good in terms of readability. Can we 
> introduce an enum for this?

I agree that's not good to read.  Liam alos pointed out a similar issue but
suggested splitting functions with clear names[1].  I think that also fairly
improves readability, and I slightly prefer that way, since it wouldn't
introduce a new type for only a single use case.  Would that also work for your
concern, or do you have a different opinion?

[1] https://lore.kernel.org/[email protected]


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise()
  2025-01-15  6:19   ` SeongJae Park
@ 2025-01-15  7:15     ` Honggyu Kim
  0 siblings, 0 replies; 10+ messages in thread
From: Honggyu Kim @ 2025-01-15  7:15 UTC (permalink / raw)
  To: SeongJae Park
  Cc: kernel_team, Andrew Morton, Jens Axboe, Pavel Begunkov, damon,
	io-uring, linux-kernel, linux-mm, lorenzo.stoakes, Liam.Howlett

Hi SeongJae,

I'm resending this because my new mail client mistakenly set the mail
format to HTML. Sorry for the noise.

On 1/15/2025 3:19 PM, SeongJae Park wrote:
> Hi Honggyu,
> 
> On Wed, 15 Jan 2025 13:35:48 +0900 Honggyu Kim <[email protected]> wrote:
> 
>> Hi SeongJae,
>>
>> I have a simple comment on this.
>>
>> On 1/11/2025 9:46 AM, SeongJae Park wrote:
>>> process_madvise() calls do_madvise() for each address range.  Then, each
>>> do_madvise() invocation holds and releases same mmap_lock.  Optimize the
>>> redundant lock operations by doing the locking in process_madvise(), and
>>> inform do_madvise() that the lock is already held and therefore can be
>>> skipped.
> [...]
>>> ---
>>>    include/linux/mm.h |  3 ++-
>>>    io_uring/advise.c  |  2 +-
>>>    mm/damon/vaddr.c   |  2 +-
>>>    mm/madvise.c       | 54 +++++++++++++++++++++++++++++++++++-----------
>>>    4 files changed, 45 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 612b513ebfbd..e3ca5967ebd4 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3459,7 +3459,8 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>>>    		    unsigned long end, struct list_head *uf, bool unlock);
>>>    extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>>>    		     struct list_head *uf);
>>> -extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
>>> +extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
>>> +		int behavior, bool lock_held);
>>>    
>>>    #ifdef CONFIG_MMU
>>>    extern int __mm_populate(unsigned long addr, unsigned long len,
>>> diff --git a/io_uring/advise.c b/io_uring/advise.c
>>> index cb7b881665e5..010b55d5a26e 100644
>>> --- a/io_uring/advise.c
>>> +++ b/io_uring/advise.c
>>> @@ -56,7 +56,7 @@ int io_madvise(struct io_kiocb *req, unsigned int issue_flags)
>>>    
>>>    	WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
>>>    
>>> -	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice);
>>> +	ret = do_madvise(current->mm, ma->addr, ma->len, ma->advice, false);
>>
>> I feel like this doesn't look good in terms of readability. Can we
>> introduce an enum for this?
> 
> I agree that's not good to read.  Liam alos pointed out a similar issue but

Right. I didn't carefully read his comment. Thanks for the info.

> suggested splitting functions with clear names[1].  I think that also fairly
> improves readability, and I slightly prefer that way, since it wouldn't
> introduce a new type for only a single use case.  Would that also work for your
> concern, or do you have a different opinion?
> 
> [1] https://lore.kernel.org/[email protected]

I don't have any other concern.

Thanks,
Honggyu

> Thanks,
> SJ
> 
> [...]


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-01-15  7:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-11  0:46 [RFC PATCH] mm/madvise: remove redundant mmap_lock operations from process_madvise() SeongJae Park
2025-01-14 18:13 ` Shakeel Butt
2025-01-14 18:47   ` Lorenzo Stoakes
2025-01-14 19:54     ` SeongJae Park
2025-01-14 21:55     ` Shakeel Butt
2025-01-15  3:44   ` Liam R. Howlett
2025-01-15  4:17     ` SeongJae Park
2025-01-15  4:35 ` Honggyu Kim
2025-01-15  6:19   ` SeongJae Park
2025-01-15  7:15     ` Honggyu Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox