[RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes
@ 2022-05-18 23:36 Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks Stefan Roesch
                   ` (17 more replies)
  0 siblings, 18 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This patch series adds support for async buffered writes when using both
xfs and io-uring. Currently io-uring only supports buffered writes in the
slow path, by processing them in the io workers. With this patch series it is
now possible to support buffered writes in the fast path. To be able to use
the fast path the required pages must be in the page cache, the required locks
in xfs can be granted immediately and no additional blocks need to be read
form disk.

Updating the inode can take time. An optimization has been implemented for
the time update. Time updates will be processed in the slow path. While there
is already a time update in process, other write requests for the same file,
can skip the update of the modification time.

Performance results:
  For fio the following results have been obtained with a queue depth of
  1 and 4k block size (runtime 600 secs):

                 sequential writes:
                 without patch           with patch      libaio     psync
  iops:              77k                    209k          195K       233K
  bw:               314MB/s                 854MB/s       790MB/s    953MB/s
  clat:            9600ns                   120ns         540ns     3000ns

For an io depth of 1, the new patch improves throughput by over three times
(compared to the exiting behavior, where buffered writes are processed by an
io-worker process) and also the latency is considerably reduced. To achieve the
same or better performance with the exisiting code an io depth of 4 is required.
Increasing the iodepth further does not lead to improvements.

In addition the latency of buffered write operations is reduced considerably.

Support for async buffered writes:

  To support async buffered writes the flag FMODE_BUF_WASYNC is introduced. In
  addition the check in generic_write_checks is modified to allow for async
  buffered writes that have this flag set.

  Changes to the iomap page create function to allow the caller to specify
  the gfp flags. Sets the IOMAP_NOWAIT flag in iomap if IOCB_NOWAIT has been set
  and specifies the requested gfp flags.

  Adds the iomap async buffered write support to the xfs iomap layer.
  Adds async buffered write support to the xfs iomap layer.

Support for async buffered write support and inode time modification

  Splits the functions for checking if the file privileges need to be removed in
  two functions: check function and a function for the removal of file privileges.
  The same split is also done for the function to update the file modification time.

  Implement an optimization that while a file modification time is pending other
  requests for the same file don't need to wait for the file modification update. 
  This avoids that a considerable number of buffered async write requests get
  punted.

  Take the ilock in nowait mode if async buffered writes are enabled and enable
  the async buffered writes optimization in io_uring.

Support for write throttling of async buffered writes:

  Add a no_wait parameter to the exisiting balance_dirty_pages() function. The
  function will return -EAGAIN if the parameter is true and write throttling is
  required.

  Add a new function called balance_dirty_pages_ratelimited_async() that will be
  invoked from iomap_write_iter() if an async buffered write is requested.

Enable async buffered write support in xfs
   This enables async buffered writes for xfs.

Testing:
  This patch has been tested with xfstests and fio.

Changes:
  V3:
  - Reformat new code in generic_write_checks_count() to line lengthof 80.
  - Remove if condition in __iomap_write_begin to maintain current behavior.
  - use GFP_NOWAIT flag in __iomap_write_begin
  - rename need_file_remove_privs() function to file_needs_remove_privs()
  - rename do_file_remove_privs to __file_remove_privs()
  - add kernel documentation to file_remove_privs() function
  - rework else if branch in file_remove_privs() function
  - add kernel documentation to file_modified() function
  - add kernel documentation to file_modified_async() function
  - rename err variable in file_update_time to ret
  - rename function need_file_update_time() to file_needs_update_time()
  - rename function do_file_update_time() to __file_update_time()
  - don't move check for FMODE_NOCMTIME in generic helper
  - reformat __file_update_time for easier reading
  - add kernel documentation to file_update_time() function
  - fix if in file_update_time from < to <=
  - move modification of inode flags from do_file_update_time to file_modified()
    When this function is called, the caller must hold the inode lock.
  - 3 new patches from Jan to add new no_wait flag to balance_dirty_pages(),
    remove patch 12 from previous series
  - Make balance_dirty_pages_ratelimited_flags() a static function
  - Add new balance_dirty_pages_ratelimited_async() function

  V2:
  - Remove atomic allocation
  - Use direct write in xfs_buffered_write_iomap_begin()
  - Use xfs_ilock_for_iomap() in xfs_buffered_write_iomap_begin()
  - Remove no_wait check at the end of xfs_buffered_write_iomap_begin() for
    the COW path.
  - Pass xfs_inode pointer to xfs_ilock_iocb and rename function to
    xfs_lock_xfs_inode
  - Replace existing uses of xfs_ilock_iocb with xfs_ilock_xfs_inode
  - Use xfs_ilock_xfs_inode in xfs_file_buffered_write()
  - Callers of xfs_ilock_for_iomap need to initialize lock mode. This is
    required so writes use an exclusive lock
  - Split of _balance_dirty_pages() from balance_dirty_pages() and return
    sleep time
  - Call _balance_dirty_pages() in balance_dirty_pages_ratelimited_flags()
  - Move call to balance_dirty_pages_ratelimited_flags() in iomap_write_iter()
    to the beginning of the loop

Jan Kara (3):
  mm: Move starting of background writeback into the main balancing loop
  mm: Move updates of dirty_exceeded into one place
  mm: Prepare balance_dirty_pages() for async buffered writes

Stefan Roesch (15):
  block: Add check for async buffered writes to generic_write_checks
  iomap: Add iomap_page_create_gfp to allocate iomap_pages
  iomap: Use iomap_page_create_gfp() in __iomap_write_begin
  iomap: Add async buffered write support
  xfs: Add iomap async buffered write support
  fs: Split off remove_needs_file_privs() __remove_file_privs()
  fs: Split off file_needs_update_time and __file_update_time
  xfs: Enable async write file modification handling.
  fs: Optimization for concurrent file time updates.
  xfs: Add async buffered write support
  io_uring: Add support for async buffered writes
  mm: Add balance_dirty_pages_ratelimited_async() function
  iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter
  io_uring: Add tracepoint for short writes
  xfs: Enable async buffered write support

 fs/inode.c                      | 178 ++++++++++++++++++++++++--------
 fs/io_uring.c                   |  32 +++++-
 fs/iomap/buffered-io.c          |  71 +++++++++++--
 fs/read_write.c                 |   4 +-
 fs/xfs/xfs_file.c               |  36 +++----
 fs/xfs/xfs_iomap.c              |  14 ++-
 include/linux/fs.h              |   7 ++
 include/linux/writeback.h       |   1 +
 include/trace/events/io_uring.h |  25 +++++
 mm/page-writeback.c             | 108 ++++++++++++-------
 10 files changed, 352 insertions(+), 124 deletions(-)

base-commit: 0cdd776ec92c0fec768c7079331804d3e52d4b27
-- 
2.30.2

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-19  8:17   ` Christoph Hellwig
  2022-05-18 23:36 ` [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also modifies the check in
generic_write_checks to take async buffered writes into consideration.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/read_write.c    | 4 +++-
 include/linux/fs.h | 3 +++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e643aec2b0ef..544d4df33f4f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1633,7 +1633,9 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
 	if (iocb->ki_flags & IOCB_APPEND)
 		iocb->ki_pos = i_size_read(inode);
 
-	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+		!((iocb->ki_flags & IOCB_DIRECT) ||
+		  (file->f_mode & FMODE_BUF_WASYNC)))
 		return -EINVAL;
 
 	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..3b479d02e210 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -177,6 +177,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks
  2022-05-18 23:36 ` [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks Stefan Roesch
@ 2022-05-19  8:17   ` Christoph Hellwig
  2022-05-20 18:23     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:17 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

On Wed, May 18, 2022 at 04:36:52PM -0700, Stefan Roesch wrote:
> @@ -1633,7 +1633,9 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
>  	if (iocb->ki_flags & IOCB_APPEND)
>  		iocb->ki_pos = i_size_read(inode);
>  
> -	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
> +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> +		!((iocb->ki_flags & IOCB_DIRECT) ||
> +		  (file->f_mode & FMODE_BUF_WASYNC)))

This is some really odd indentation.  I'd expect something like:

	if ((iocb->ki_flags & IOCB_NOWAIT) &&
	    !((iocb->ki_flags & IOCB_DIRECT) ||
	      (file->f_mode & FMODE_BUF_WASYNC)))

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bbde95387a23..3b479d02e210 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -177,6 +177,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File supports async buffered reads */
>  #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
>  
> +/* File supports async nowait buffered writes */
> +#define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)

This is the last available flag in fmode_t.

At some point we should probably move the static capabilities to
a member of file_operations.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks
  2022-05-19  8:17   ` Christoph Hellwig
@ 2022-05-20 18:23     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:17 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 04:36:52PM -0700, Stefan Roesch wrote:
>> @@ -1633,7 +1633,9 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
>>  	if (iocb->ki_flags & IOCB_APPEND)
>>  		iocb->ki_pos = i_size_read(inode);
>>  
>> -	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
>> +	if ((iocb->ki_flags & IOCB_NOWAIT) &&
>> +		!((iocb->ki_flags & IOCB_DIRECT) ||
>> +		  (file->f_mode & FMODE_BUF_WASYNC)))
> 
> This is some really odd indentation.  I'd expect something like:
> 
> 	if ((iocb->ki_flags & IOCB_NOWAIT) &&
> 	    !((iocb->ki_flags & IOCB_DIRECT) ||
> 	      (file->f_mode & FMODE_BUF_WASYNC)))
> 

I reformatted the above code for the next version.

>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index bbde95387a23..3b479d02e210 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -177,6 +177,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>>  /* File supports async buffered reads */
>>  #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
>>  
>> +/* File supports async nowait buffered writes */
>> +#define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
> 
> This is the last available flag in fmode_t.
> 
> At some point we should probably move the static capabilities to
> a member of file_operations.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-19  8:18   ` Christoph Hellwig
  2022-05-18 23:36 ` [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

Add the function iomap_page_create_gfp() to be able to specify gfp flags
and to pass in the number of blocks per folio in the function
iomap_page_create_gfp().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/iomap/buffered-io.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..85aa32f50db0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,17 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+/**
+ * iomap_page_create_gfp : Create and initialize iomap_page for folio.
+ * @inode     : Pointer to inode
+ * @folio     : Pointer to folio
+ * @nr_blocks : Number of blocks in the folio
+ * @gfp       : gfp allocation flags
+ *
+ * This function returns a newly allocated iomap for the folio with the settings
+ * specified in the gfp parameter.
+ *
+ **/
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio)
+iomap_page_create_gfp(struct inode *inode, struct folio *folio,
+		unsigned int nr_blocks, gfp_t gfp)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	struct iomap_page *iop;
 
-	if (iop || nr_blocks <= 1)
+	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), gfp);
+	if (!iop)
 		return iop;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
-			GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
 	if (folio_test_uptodate(folio))
 		bitmap_fill(iop->uptodate, nr_blocks);
@@ -61,6 +71,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
 	return iop;
 }
 
+static struct iomap_page *
+iomap_page_create(struct inode *inode, struct folio *folio)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+
+	if (iop || nr_blocks <= 1)
+		return iop;
+
+	return iomap_page_create_gfp(inode, folio, nr_blocks, GFP_NOFS | __GFP_NOFAIL);
+}
+
 static void iomap_page_release(struct folio *folio)
 {
 	struct iomap_page *iop = folio_detach_private(folio);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages
  2022-05-18 23:36 ` [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
@ 2022-05-19  8:18   ` Christoph Hellwig
  2022-05-20 18:25     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:18 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

 * This function returns a newly allocated iomap for the folio with the settings
> + * specified in the gfp parameter.
> + *
> + **/
>  static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct folio *folio)
> +iomap_page_create_gfp(struct inode *inode, struct folio *folio,
> +		unsigned int nr_blocks, gfp_t gfp)
>  {
> -	struct iomap_page *iop = to_iomap_page(folio);
> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +	struct iomap_page *iop;
>  
> -	if (iop || nr_blocks <= 1)
> +	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), gfp);
> +	if (!iop)
>  		return iop;
>  
> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> -			GFP_NOFS | __GFP_NOFAIL);
>  	spin_lock_init(&iop->uptodate_lock);
>  	if (folio_test_uptodate(folio))
>  		bitmap_fill(iop->uptodate, nr_blocks);
> @@ -61,6 +71,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
>  	return iop;
>  }
>  
> +static struct iomap_page *
> +iomap_page_create(struct inode *inode, struct folio *folio)
> +{
> +	struct iomap_page *iop = to_iomap_page(folio);
> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> +
> +	if (iop || nr_blocks <= 1)
> +		return iop;
> +
> +	return iomap_page_create_gfp(inode, folio, nr_blocks, GFP_NOFS | __GFP_NOFAIL);

Overly long line here.

Mor importantly why do you need a helper that does not do the number
of blocks check?  Why can't we just pass a gfp_t to iomap_page_create?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages
  2022-05-19  8:18   ` Christoph Hellwig
@ 2022-05-20 18:25     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:18 AM, Christoph Hellwig wrote:
>  * This function returns a newly allocated iomap for the folio with the settings
>> + * specified in the gfp parameter.
>> + *
>> + **/
>>  static struct iomap_page *
>> -iomap_page_create(struct inode *inode, struct folio *folio)
>> +iomap_page_create_gfp(struct inode *inode, struct folio *folio,
>> +		unsigned int nr_blocks, gfp_t gfp)
>>  {
>> -	struct iomap_page *iop = to_iomap_page(folio);
>> -	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>> +	struct iomap_page *iop;
>>  
>> -	if (iop || nr_blocks <= 1)
>> +	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), gfp);
>> +	if (!iop)
>>  		return iop;
>>  
>> -	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
>> -			GFP_NOFS | __GFP_NOFAIL);
>>  	spin_lock_init(&iop->uptodate_lock);
>>  	if (folio_test_uptodate(folio))
>>  		bitmap_fill(iop->uptodate, nr_blocks);
>> @@ -61,6 +71,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
>>  	return iop;
>>  }
>>  
>> +static struct iomap_page *
>> +iomap_page_create(struct inode *inode, struct folio *folio)
>> +{
>> +	struct iomap_page *iop = to_iomap_page(folio);
>> +	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>> +
>> +	if (iop || nr_blocks <= 1)
>> +		return iop;
>> +
>> +	return iomap_page_create_gfp(inode, folio, nr_blocks, GFP_NOFS | __GFP_NOFAIL);
> 
> Overly long line here.
> 
> Mor importantly why do you need a helper that does not do the number
> of blocks check?  Why can't we just pass a gfp_t to iomap_page_create?


The next version removes iomap_page_create_gfp() and adds the gfp flag to
iomap_page_create.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-19  8:19   ` Christoph Hellwig
  2022-05-18 23:36 ` [RFC PATCH v3 04/18] iomap: Add async buffered write support Stefan Roesch
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This change uses the new iomap_page_create_gfp() function in the
function __iomap_write_begin().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/iomap/buffered-io.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 85aa32f50db0..6b06fd358958 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -572,17 +572,21 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct iomap_page *iop = iomap_page_create(iter->inode, folio);
+	struct iomap_page *iop = to_iomap_page(folio);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
+	unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
+	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
 
+	iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin
  2022-05-18 23:36 ` [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
@ 2022-05-19  8:19   ` Christoph Hellwig
  2022-05-20 18:26     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:19 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

On Wed, May 18, 2022 at 04:36:54PM -0700, Stefan Roesch wrote:
> This change uses the new iomap_page_create_gfp() function in the
> function __iomap_write_begin().

.. and this now loses the check if we actually need the creation,
see my previous comment.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin
  2022-05-19  8:19   ` Christoph Hellwig
@ 2022-05-20 18:26     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:19 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 04:36:54PM -0700, Stefan Roesch wrote:
>> This change uses the new iomap_page_create_gfp() function in the
>> function __iomap_write_begin().
> 
> .. and this now loses the check if we actually need the creation,
> see my previous comment.

The new version avoids this problem by removing iomap_page_create_gfp()
and adding a gfp flag to iomap_page_create().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 04/18] iomap: Add async buffered write support
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (2 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-19  8:25   ` Christoph Hellwig
  2022-05-18 23:36 ` [RFC PATCH v3 05/18] xfs: Add iomap " Stefan Roesch
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This adds async buffered write support to iomap. The support is focused
on the changes necessary to support XFS with iomap.

Support for other filesystems might require additional changes.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/iomap/buffered-io.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6b06fd358958..b029e2b10e07 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,12 +580,18 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
+	bool no_wait = (iter->flags & IOMAP_NOWAIT);
+
+	if (no_wait)
+		gfp = GFP_NOWAIT;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
 
 	iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+	if (no_wait && !iop)
+		return -EAGAIN;
 
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
@@ -602,6 +608,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
 				return -EIO;
 			folio_zero_segments(folio, poff, from, to, poff + plen);
+		} else if (no_wait) {
+			return -EAGAIN;
 		} else {
 			int status = iomap_read_folio_sync(block_start, folio,
 					poff, plen, srcmap);
@@ -632,6 +640,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
+	if (iter->flags & IOMAP_NOWAIT)
+		fgp |= FGP_NOWAIT;
+
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
 		BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -789,6 +800,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * For async buffered writes the assumption is that the user
+		 * page has already been faulted in. This can be optimized by
+		 * faulting the user page in the prepare phase of io-uring.
 		 */
 		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
 			status = -EFAULT;
@@ -844,6 +859,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 	};
 	int ret;
 
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		iter.flags |= IOMAP_NOWAIT;
+
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_write_iter(&iter, i);
 	if (iter.pos == iocb->ki_pos)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/18] iomap: Add async buffered write support
  2022-05-18 23:36 ` [RFC PATCH v3 04/18] iomap: Add async buffered write support Stefan Roesch
@ 2022-05-19  8:25   ` Christoph Hellwig
  2022-05-20 18:29     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:25 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

On Wed, May 18, 2022 at 04:36:55PM -0700, Stefan Roesch wrote:
> This adds async buffered write support to iomap. The support is focused
> on the changes necessary to support XFS with iomap.
> 
> Support for other filesystems might require additional changes.

What would those other changes be?  Inline data support should not
matter here, so I guess it is buffer_head support?  Please spell out
the actual limitations instead of the use case.  Preferably including
asserts in the code to catch the case of a file system trying to use
the now supported cases.

> 
> Signed-off-by: Stefan Roesch <[email protected]>
> ---
>  fs/iomap/buffered-io.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 6b06fd358958..b029e2b10e07 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -580,12 +580,18 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>  	size_t poff, plen;
>  	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
> +	bool no_wait = (iter->flags & IOMAP_NOWAIT);
> +
> +	if (no_wait)

Does thi flag really buy us anything?  My preference woud be to see
the IOMAP_NOWAIT directy as that is easier for me to read than trying to
figure out what no_wait actually means.

> +		gfp = GFP_NOWAIT;
>  
>  	if (folio_test_uptodate(folio))
>  		return 0;
>  	folio_clear_error(folio);
>  
>  	iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);

And maybe the btter iomap_page_create inteface would be one that passes
the flags so that we can centralize the gfp_t selection.

> @@ -602,6 +608,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  			if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
>  				return -EIO;
>  			folio_zero_segments(folio, poff, from, to, poff + plen);
> +		} else if (no_wait) {
> +			return -EAGAIN;
>  		} else {
>  			int status = iomap_read_folio_sync(block_start, folio,
>  					poff, plen, srcmap);

That's a somewhat unnatural code flow.  I'd much prefer:

		} else {
			int status;

			if (iter->flags & IOMAP_NOWAIT)
				return -EAGAIN;
			iomap_read_folio_sync(block_start, folio,
					poff, plen, srcmap);

Or maybe even pass the iter to iomap_read_folio_sync and just do the
IOMAP_NOWAIT check there.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 04/18] iomap: Add async buffered write support
  2022-05-19  8:25   ` Christoph Hellwig
@ 2022-05-20 18:29     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:25 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 04:36:55PM -0700, Stefan Roesch wrote:
>> This adds async buffered write support to iomap. The support is focused
>> on the changes necessary to support XFS with iomap.
>>
>> Support for other filesystems might require additional changes.
> 
> What would those other changes be?  Inline data support should not
> matter here, so I guess it is buffer_head support?  Please spell out
> the actual limitations instead of the use case.  Preferably including
> asserts in the code to catch the case of a file system trying to use
> the now supported cases.
> 

I was only trying to make the point that I haven't enabled this on other
filesystems than XFS. Removing the statement as it causes confusion.

>>
>> Signed-off-by: Stefan Roesch <[email protected]>
>> ---
>>  fs/iomap/buffered-io.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 6b06fd358958..b029e2b10e07 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -580,12 +580,18 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  	size_t from = offset_in_folio(folio, pos), to = from + len;
>>  	size_t poff, plen;
>>  	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
>> +	bool no_wait = (iter->flags & IOMAP_NOWAIT);
>> +
>> +	if (no_wait)
> 
> Does thi flag really buy us anything?  My preference woud be to see
> the IOMAP_NOWAIT directy as that is easier for me to read than trying to
> figure out what no_wait actually means.
>

Removed the no_wait variable and instead used the flag check directly in the code.
 
>> +		gfp = GFP_NOWAIT;
>>  
>>  	if (folio_test_uptodate(folio))
>>  		return 0;
>>  	folio_clear_error(folio);
>>  
>>  	iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
> 
> And maybe the btter iomap_page_create inteface would be one that passes
> the flags so that we can centralize the gfp_t selection.
> 
>> @@ -602,6 +608,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>>  			if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
>>  				return -EIO;
>>  			folio_zero_segments(folio, poff, from, to, poff + plen);
>> +		} else if (no_wait) {
>> +			return -EAGAIN;
>>  		} else {
>>  			int status = iomap_read_folio_sync(block_start, folio,
>>  					poff, plen, srcmap);
> 
> That's a somewhat unnatural code flow.  I'd much prefer:
> 

I made the below change.

> 		} else {
> 			int status;
> 
> 			if (iter->flags & IOMAP_NOWAIT)
> 				return -EAGAIN;
> 			iomap_read_folio_sync(block_start, folio,
> 					poff, plen, srcmap);
> 
> Or maybe even pass the iter to iomap_read_folio_sync and just do the
> IOMAP_NOWAIT check there.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 05/18] xfs: Add iomap async buffered write support
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (3 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 04/18] iomap: Add async buffered write support Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 06/18] fs: Split off remove_needs_file_privs() __remove_file_privs() Stefan Roesch
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This adds the async buffered write support to the iomap layer of XFS. If
a lock cannot be acquired or additional reads need to be performed, the
request will return -EAGAIN in case this is an async buffered write
request.

This patch changes the helper function xfs_ilock_for_iomap such that the
lock mode needs to be passed in.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/xfs/xfs_iomap.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e552ce541ec2..1aea962262ad 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -659,7 +659,7 @@ xfs_ilock_for_iomap(
 	unsigned		flags,
 	unsigned		*lockmode)
 {
-	unsigned		mode = XFS_ILOCK_SHARED;
+	unsigned int		mode = *lockmode;
 	bool			is_write = flags & (IOMAP_WRITE | IOMAP_ZERO);
 
 	/*
@@ -737,7 +737,7 @@ xfs_direct_write_iomap_begin(
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
-	unsigned		lockmode;
+	unsigned int		lockmode = XFS_ILOCK_SHARED;
 
 	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
 
@@ -881,18 +881,22 @@ xfs_buffered_write_iomap_begin(
 	bool			eof = false, cow_eof = false, shared = false;
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
+	unsigned int		lockmode = XFS_ILOCK_EXCL;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
 	/* we can't use delayed allocations when using extent size hints */
-	if (xfs_get_extsz_hint(ip))
+	if (xfs_get_extsz_hint(ip)) {
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
+	}
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
+	if (error)
+		return error;
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
@@ -1167,7 +1171,7 @@ xfs_read_iomap_begin(
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
-	unsigned		lockmode;
+	unsigned int		lockmode = XFS_ILOCK_SHARED;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 06/18] fs: Split off remove_needs_file_privs() __remove_file_privs()
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (4 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 05/18] xfs: Add iomap " Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 07/18] fs: Split off file_needs_update_time and __file_update_time Stefan Roesch
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This splits off the function remove_needs_file_privs() from the function
__remove_file_privs() from the function file_remove_privs().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/inode.c | 75 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 55 insertions(+), 20 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..1bb8b7db836f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2010,17 +2010,8 @@ static int __remove_privs(struct user_namespace *mnt_userns,
 	return notify_change(mnt_userns, dentry, &newattrs, NULL);
 }
 
-/*
- * Remove special file priviledges (suid, capabilities) when file is written
- * to or truncated.
- */
-int file_remove_privs(struct file *file)
+static int file_needs_remove_privs(struct inode *inode, struct dentry *dentry)
 {
-	struct dentry *dentry = file_dentry(file);
-	struct inode *inode = file_inode(file);
-	int kill;
-	int error = 0;
-
 	/*
 	 * Fast path for nothing security related.
 	 * As well for non-regular files, e.g. blkdev inodes.
@@ -2030,16 +2021,42 @@ int file_remove_privs(struct file *file)
 	if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode))
 		return 0;
 
-	kill = dentry_needs_remove_privs(dentry);
-	if (kill < 0)
-		return kill;
-	if (kill)
-		error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
+	return dentry_needs_remove_privs(dentry);
+}
+
+static int __file_remove_privs(struct file *file, struct inode *inode,
+			struct dentry *dentry, int kill)
+{
+	int error = 0;
+
+	error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
 	return error;
 }
+
+/**
+ * file_remove_privs - remove special file privileges (suid, capabilities)
+ * @file: file to remove privileges from
+ *
+ * When file is modified by a write or truncation ensure that special
+ * file privileges are removed.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int file_remove_privs(struct file *file)
+{
+	struct dentry *dentry = file_dentry(file);
+	struct inode *inode = file_inode(file);
+	int kill;
+
+	kill = file_needs_remove_privs(inode, dentry);
+	if (kill <= 0)
+		return kill;
+
+	return __file_remove_privs(file, inode, dentry, kill);
+}
 EXPORT_SYMBOL(file_remove_privs);
 
 /**
@@ -2090,18 +2107,36 @@ int file_update_time(struct file *file)
 }
 EXPORT_SYMBOL(file_update_time);
 
-/* Caller must hold the file's inode lock */
+/**
+ * file_modified - handle mandated vfs changes when modifying a file
+ * @file: file that was modified
+ *
+ * When file has been modified ensure that special
+ * file privileges are removed and time settings are updated.
+ *
+ * Context: Caller must hold the file's inode lock.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
 int file_modified(struct file *file)
 {
-	int err;
+	int ret;
+	struct dentry *dentry = file_dentry(file);
+	struct inode *inode = file_inode(file);
 
 	/*
 	 * Clear the security bits if the process is not being run by root.
 	 * This keeps people from modifying setuid and setgid binaries.
 	 */
-	err = file_remove_privs(file);
-	if (err)
-		return err;
+	ret = file_needs_remove_privs(inode, dentry);
+	if (ret < 0)
+		return ret;
+
+	if (ret > 0) {
+		ret = __file_remove_privs(file, inode, dentry, ret);
+		if (ret)
+			return ret;
+	}
 
 	if (unlikely(file->f_mode & FMODE_NOCMTIME))
 		return 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 07/18] fs: Split off file_needs_update_time and __file_update_time
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (5 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 06/18] fs: Split off remove_needs_file_privs() __remove_file_privs() Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-18 23:36 ` [RFC PATCH v3 08/18] xfs: Enable async write file modification handling Stefan Roesch
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This splits off the functions file_needs_update_time() and
__file_update_time() from the function file_update_time().

This is required to support async buffered writes.
No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/inode.c | 75 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 49 insertions(+), 26 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 1bb8b7db836f..4bb7f583cc6b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2059,35 +2059,19 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
-/**
- *	file_update_time	-	update mtime and ctime time
- *	@file: file accessed
- *
- *	Update the mtime and ctime members of an inode and mark the inode
- *	for writeback.  Note that this function is meant exclusively for
- *	usage in the file write path of filesystems, and filesystems may
- *	choose to explicitly ignore update via this function with the
- *	S_NOCMTIME inode flag, e.g. for network filesystem where these
- *	timestamps are handled by the server.  This can return an error for
- *	file systems who need to allocate space in order to update an inode.
- */
-
-int file_update_time(struct file *file)
+static int file_needs_update_time(struct inode *inode, struct file *file,
+				struct timespec64 *now)
 {
-	struct inode *inode = file_inode(file);
-	struct timespec64 now;
 	int sync_it = 0;
-	int ret;
 
 	/* First try to exhaust all avenues to not sync */
 	if (IS_NOCMTIME(inode))
 		return 0;
 
-	now = current_time(inode);
-	if (!timespec64_equal(&inode->i_mtime, &now))
+	if (!timespec64_equal(&inode->i_mtime, now))
 		sync_it = S_MTIME;
 
-	if (!timespec64_equal(&inode->i_ctime, &now))
+	if (!timespec64_equal(&inode->i_ctime, now))
 		sync_it |= S_CTIME;
 
 	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2096,15 +2080,49 @@ int file_update_time(struct file *file)
 	if (!sync_it)
 		return 0;
 
-	/* Finally allowed to write? Takes lock. */
-	if (__mnt_want_write_file(file))
-		return 0;
+	return sync_it;
+}
+
+static int __file_update_time(struct inode *inode, struct file *file,
+			struct timespec64 *now, int sync_mode)
+{
+	int ret = 0;
 
-	ret = inode_update_time(inode, &now, sync_it);
-	__mnt_drop_write_file(file);
+	/* try to update time settings */
+	if (!__mnt_want_write_file(file)) {
+		ret = inode_update_time(inode, now, sync_mode);
+		__mnt_drop_write_file(file);
+	}
 
 	return ret;
 }
+
+ /**
+  * file_update_time - update mtime and ctime time
+  * @file: file accessed
+  *
+  * Update the mtime and ctime members of an inode and mark the inode for
+  * writeback. Note that this function is meant exclusively for usage in
+  * the file write path of filesystems, and filesystems may choose to
+  * explicitly ignore updates via this function with the _NOCMTIME inode
+  * flag, e.g. for network filesystem where these imestamps are handled
+  * by the server. This can return an error for file systems who need to
+  * allocate space in order to update an inode.
+  *
+  * Return: 0 on success, negative errno on failure.
+  */
+int file_update_time(struct file *file)
+{
+	int ret;
+	struct inode *inode = file_inode(file);
+	struct timespec64 now = current_time(inode);
+
+	ret = file_needs_update_time(inode, file, &now);
+	if (ret <= 0)
+		return ret;
+
+	return __file_update_time(inode, file, &now, ret);
+}
 EXPORT_SYMBOL(file_update_time);
 
 /**
@@ -2123,6 +2141,7 @@ int file_modified(struct file *file)
 	int ret;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = file_inode(file);
+	struct timespec64 now = current_time(inode);
 
 	/*
 	 * Clear the security bits if the process is not being run by root.
@@ -2141,7 +2160,11 @@ int file_modified(struct file *file)
 	if (unlikely(file->f_mode & FMODE_NOCMTIME))
 		return 0;
 
-	return file_update_time(file);
+	ret = file_needs_update_time(inode, file, &now);
+	if (ret <= 0)
+		return ret;
+
+	return __file_update_time(inode, file, &now, ret);
 }
 EXPORT_SYMBOL(file_modified);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 08/18] xfs: Enable async write file modification handling.
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (6 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 07/18] fs: Split off file_needs_update_time and __file_update_time Stefan Roesch
@ 2022-05-18 23:36 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 09/18] fs: Optimization for concurrent file time updates Stefan Roesch
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:36 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This modifies xfs file_modified() function to return -EAGAIN if the
request either requires to remove privileges or needs to update the file
modification time. This is required for async buffered writes, so the
request gets handled in the io worker of io-uring.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/inode.c         | 25 ++++++++++++++++++++++++-
 fs/xfs/xfs_file.c  |  2 +-
 include/linux/fs.h |  1 +
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4bb7f583cc6b..3a5d0fa468ab 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2137,6 +2137,27 @@ EXPORT_SYMBOL(file_update_time);
  * Return: 0 on success, negative errno on failure.
  */
 int file_modified(struct file *file)
+{
+	return file_modified_async(file, 0);
+}
+EXPORT_SYMBOL(file_modified);
+
+/**
+ * file_modified_async - handle mandated vfs changes when modifying a file
+ * @file: file that was modified
+ * @flags: kiocb flags
+ *
+ * When file has been modified ensure that special
+ * file privileges are removed and time settings are updated.
+ *
+ * If IOCB_NOWAIT is set, special file privileges will not be removed and
+ * time settings will not be updated. It will return -EAGAIN.
+ *
+ * Context: Caller must hold the file's inode lock.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int file_modified_async(struct file *file, int flags)
 {
 	int ret;
 	struct dentry *dentry = file_dentry(file);
@@ -2163,10 +2184,12 @@ int file_modified(struct file *file)
 	ret = file_needs_update_time(inode, file, &now);
 	if (ret <= 0)
 		return ret;
+	if (flags & IOCB_NOWAIT)
+		return -EAGAIN;
 
 	return __file_update_time(inode, file, &now, ret);
 }
-EXPORT_SYMBOL(file_modified);
+EXPORT_SYMBOL(file_modified_async);
 
 int inode_needs_sync(struct inode *inode)
 {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5bddb1e9e0b3..793918c83755 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -410,7 +410,7 @@ xfs_file_write_checks(
 		spin_unlock(&ip->i_flags_lock);
 
 out:
-	return file_modified(file);
+	return file_modified_async(file, iocb->ki_flags);
 }
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b479d02e210..9760283af7dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2382,6 +2382,7 @@ static inline void file_accessed(struct file *file)
 }
 
 extern int file_modified(struct file *file);
+extern int file_modified_async(struct file *file, int flags);
 
 int sync_inode_metadata(struct inode *inode, int wait);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 09/18] fs: Optimization for concurrent file time updates.
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (7 preceding siblings ...)
  2022-05-18 23:36 ` [RFC PATCH v3 08/18] xfs: Enable async write file modification handling Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 10/18] xfs: Add async buffered write support Stefan Roesch
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This introduces the S_PENDING_TIME flag. If an async buffered write
needs to update the time, it cannot be processed in the fast path of
io-uring. When a time update is pending this flag is set for async
buffered writes. Other concurrent async buffered writes for the same
file do not need to wait while this time update is pending.

This reduces the number of async buffered writes that need to get punted
to the io-workers in io-uring.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/inode.c         | 11 +++++++++--
 include/linux/fs.h |  3 +++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3a5d0fa468ab..5c5021787780 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2184,10 +2184,17 @@ int file_modified_async(struct file *file, int flags)
 	ret = file_needs_update_time(inode, file, &now);
 	if (ret <= 0)
 		return ret;
-	if (flags & IOCB_NOWAIT)
+	if (flags & IOCB_NOWAIT) {
+		if (IS_PENDING_TIME(inode))
+			return 0;
+
+		inode->i_flags |= S_PENDING_TIME;
 		return -EAGAIN;
+	}
 
-	return __file_update_time(inode, file, &now, ret);
+	ret = __file_update_time(inode, file, &now, ret);
+	inode->i_flags &= ~S_PENDING_TIME;
+	return ret;
 }
 EXPORT_SYMBOL(file_modified_async);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9760283af7dc..5f3aaf61fb4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2141,6 +2141,8 @@ struct super_operations {
 #define S_CASEFOLD	(1 << 15) /* Casefolded file */
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
 #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PENDING_TIME (1 << 18) /* File update time is pending */
+
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2183,6 +2185,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
 #define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
 #define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
 #define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)
+#define IS_PENDING_TIME(inode) ((inode)->i_flags & S_PENDING_TIME)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 10/18] xfs: Add async buffered write support
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (8 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 09/18] fs: Optimization for concurrent file time updates Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 11/18] io_uring: Add support for async buffered writes Stefan Roesch
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This adds the async buffered write support to XFS. For async buffered
write requests, the request will return -EAGAIN if the ilock cannot be
obtained immediately.

This splits off a new helper xfs_ilock_inode from the existing helper
xfs_ilock_iocb so it can be used for this function. The exising helper
cannot be used as it hardcoded the inode to be used.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/xfs/xfs_file.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 793918c83755..ad3175b7d366 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -190,14 +190,13 @@ xfs_file_fsync(
 	return error;
 }
 
-static int
-xfs_ilock_iocb(
-	struct kiocb		*iocb,
+static inline int
+xfs_ilock_xfs_inode(
+	struct xfs_inode	*ip,
+	int			flags,
 	unsigned int		lock_mode)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
+	if (flags & IOCB_NOWAIT) {
 		if (!xfs_ilock_nowait(ip, lock_mode))
 			return -EAGAIN;
 	} else {
@@ -222,7 +221,7 @@ xfs_file_dio_read(
 
 	file_accessed(iocb->ki_filp);
 
-	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
 	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
@@ -244,7 +243,7 @@ xfs_file_dax_read(
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
-	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
 	ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
@@ -264,7 +263,7 @@ xfs_file_buffered_read(
 
 	trace_xfs_file_buffered_read(iocb, to);
 
-	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
 	ret = generic_file_read_iter(iocb, to);
@@ -343,7 +342,7 @@ xfs_file_write_checks(
 	if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
 		xfs_iunlock(ip, *iolock);
 		*iolock = XFS_IOLOCK_EXCL;
-		error = xfs_ilock_iocb(iocb, *iolock);
+		error = xfs_ilock_xfs_inode(ip, iocb->ki_flags, *iolock);
 		if (error) {
 			*iolock = 0;
 			return error;
@@ -516,7 +515,7 @@ xfs_file_dio_write_aligned(
 	int			iolock = XFS_IOLOCK_SHARED;
 	ssize_t			ret;
 
-	ret = xfs_ilock_iocb(iocb, iolock);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, iolock);
 	if (ret)
 		return ret;
 	ret = xfs_file_write_checks(iocb, from, &iolock);
@@ -583,7 +582,7 @@ xfs_file_dio_write_unaligned(
 		flags = IOMAP_DIO_FORCE_WAIT;
 	}
 
-	ret = xfs_ilock_iocb(iocb, iolock);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, iolock);
 	if (ret)
 		return ret;
 
@@ -659,7 +658,7 @@ xfs_file_dax_write(
 	ssize_t			ret, error = 0;
 	loff_t			pos;
 
-	ret = xfs_ilock_iocb(iocb, iolock);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, iolock);
 	if (ret)
 		return ret;
 	ret = xfs_file_write_checks(iocb, from, &iolock);
@@ -702,12 +701,11 @@ xfs_file_buffered_write(
 	bool			cleared_space = false;
 	int			iolock;
 
-	if (iocb->ki_flags & IOCB_NOWAIT)
-		return -EOPNOTSUPP;
-
 write_retry:
 	iolock = XFS_IOLOCK_EXCL;
-	xfs_ilock(ip, iolock);
+	ret = xfs_ilock_xfs_inode(ip, iocb->ki_flags, iolock);
+	if (ret)
+		return ret;
 
 	ret = xfs_file_write_checks(iocb, from, &iolock);
 	if (ret)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 11/18] io_uring: Add support for async buffered writes
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (9 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 10/18] xfs: Add async buffered write support Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 12/18] mm: Move starting of background writeback into the main balancing loop Stefan Roesch
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This enables the async buffered writes for the filesystems that support
async buffered writes in io-uring. Buffered writes are enabled for
blocks that are already in the page cache or can be acquired with noio.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/io_uring.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 91de361ea9ab..f3aaac286509 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3746,7 +3746,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
 		return -EINVAL;
 }
 
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
 {
 	return req->flags & REQ_F_ISREG ||
 		S_ISBLK(file_inode(req->file)->i_mode);
@@ -3875,7 +3875,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
 	} else if (ret == req->result || ret <= 0 || !force_nonblock ||
-		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
@@ -3971,9 +3971,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(!io_file_supports_nowait(req)))
 			goto copy_iov;
 
-		/* file path doesn't support NOWAIT for non-direct_IO */
-		if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
-		    (req->flags & REQ_F_ISREG))
+		/* File path supports NOWAIT for non-direct_IO only for block devices. */
+		if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+			!(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+			(req->flags & REQ_F_ISREG))
 			goto copy_iov;
 
 		kiocb->ki_flags |= IOCB_NOWAIT;
@@ -4027,6 +4028,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		/* IOPOLL retry should happen for io-wq threads */
 		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
 			goto copy_iov;
+
+		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
+			struct io_async_rw *rw;
+
+			/* This is a partial write. The file pos has already been
+			 * updated, setup the async struct to complete the request
+			 * in the worker. Also update bytes_done to account for
+			 * the bytes already written.
+			 */
+			iov_iter_save_state(&s->iter, &s->iter_state);
+			ret = io_setup_async_rw(req, iovec, s, true);
+
+			rw = req->async_data;
+			if (rw)
+				rw->bytes_done += ret2;
+
+			return ret ? ret : -EAGAIN;
+		}
 done:
 		kiocb_done(req, ret2, issue_flags);
 	} else {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 12/18] mm: Move starting of background writeback into the main balancing loop
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (10 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 11/18] io_uring: Add support for async buffered writes Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 13/18] mm: Move updates of dirty_exceeded into one place Stefan Roesch
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

From: Jan Kara <[email protected]>

We start background writeback if we are over background threshold after
exiting the main loop in balance_dirty_pages(). This may result in
basing the decision on already stale values (we may have slept for
significant amount of time) and it is also inconvenient for refactoring
needed for async dirty throttling. Move the check into the main waiting
loop.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
---
 mm/page-writeback.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e2da284e427..8e5e003f0093 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1618,6 +1618,19 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 			}
 		}
 
+		/*
+		 * In laptop mode, we wait until hitting the higher threshold
+		 * before starting background writeout, and then write out all
+		 * the way down to the lower threshold.  So slow writers cause
+		 * minimal disk activity.
+		 *
+		 * In normal mode, we start background writeout at the lower
+		 * background_thresh, to keep the amount of dirty memory low.
+		 */
+		if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
+		    !writeback_in_progress(wb))
+			wb_start_background_writeback(wb);
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
@@ -1648,6 +1661,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 			break;
 		}
 
+		/* Start writeback even when in laptop mode */
 		if (unlikely(!writeback_in_progress(wb)))
 			wb_start_background_writeback(wb);
 
@@ -1814,23 +1828,6 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 
 	if (!dirty_exceeded && wb->dirty_exceeded)
 		wb->dirty_exceeded = 0;
-
-	if (writeback_in_progress(wb))
-		return;
-
-	/*
-	 * In laptop mode, we wait until hitting the higher threshold before
-	 * starting background writeout, and then write out all the way down
-	 * to the lower threshold.  So slow writers cause minimal disk activity.
-	 *
-	 * In normal mode, we start background writeout at the lower
-	 * background_thresh, to keep the amount of dirty memory low.
-	 */
-	if (laptop_mode)
-		return;
-
-	if (nr_reclaimable > gdtc->bg_thresh)
-		wb_start_background_writeback(wb);
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 13/18] mm: Move updates of dirty_exceeded into one place
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (11 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 12/18] mm: Move starting of background writeback into the main balancing loop Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 14/18] mm: Prepare balance_dirty_pages() for async buffered writes Stefan Roesch
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

From: Jan Kara <[email protected]>

Transition of wb->dirty_exceeded from 0 to 1 happens before we go to
sleep in balance_dirty_pages() while transition from 1 to 0 happens when
exiting from balance_dirty_pages(), possibly based on old values. This
does not make a lot of sense since wb->dirty_exceeded should simply
reflect whether wb is over dirty limit and so we should ratelimit
entering to balance_dirty_pages() less. Move the two updates together.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
---
 mm/page-writeback.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8e5e003f0093..89dcc7d8395a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1720,8 +1720,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 				sdtc = mdtc;
 		}
 
-		if (dirty_exceeded && !wb->dirty_exceeded)
-			wb->dirty_exceeded = 1;
+		if (dirty_exceeded != wb->dirty_exceeded)
+			wb->dirty_exceeded = dirty_exceeded;
 
 		if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
 					   BANDWIDTH_INTERVAL))
@@ -1825,9 +1825,6 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		if (fatal_signal_pending(current))
 			break;
 	}
-
-	if (!dirty_exceeded && wb->dirty_exceeded)
-		wb->dirty_exceeded = 0;
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 14/18] mm: Prepare balance_dirty_pages() for async buffered writes
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (12 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 13/18] mm: Move updates of dirty_exceeded into one place Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function Stefan Roesch
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

From: Jan Kara <[email protected]>

If balance_dirty_pages() gets called for async buffered write, we don't
want to wait. Instead we need to indicate to the caller that throttling
is needed so that it can stop writing and offload the rest of the write
to a context that can block.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
---
 mm/page-writeback.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 89dcc7d8395a..fc3b79acd90b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1545,8 +1545,8 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */
-static void balance_dirty_pages(struct bdi_writeback *wb,
-				unsigned long pages_dirtied)
+static int balance_dirty_pages(struct bdi_writeback *wb,
+			       unsigned long pages_dirtied, bool nowait)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
 	struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1566,6 +1566,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 	struct backing_dev_info *bdi = wb->bdi;
 	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
 	unsigned long start_time = jiffies;
+	int ret = 0;
 
 	for (;;) {
 		unsigned long now = jiffies;
@@ -1794,6 +1795,10 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 					  period,
 					  pause,
 					  start_time);
+		if (nowait) {
+			ret = -EAGAIN;
+			break;
+		}
 		__set_current_state(TASK_KILLABLE);
 		wb->dirty_sleep = now;
 		io_schedule_timeout(pause);
@@ -1825,6 +1830,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		if (fatal_signal_pending(current))
 			break;
 	}
+	return ret;
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
@@ -1906,7 +1912,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied, false);
 
 	wb_put(wb);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (13 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 14/18] mm: Prepare balance_dirty_pages() for async buffered writes Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-19  8:29   ` Christoph Hellwig
  2022-05-18 23:37 ` [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter Stefan Roesch
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This adds the helper function balance_dirty_pages_ratelimited_flags().
It adds the parameter no_wait to balance_dirty_pages_ratelimited().
For async buffered writes no_wait will be true.
A new function called balance_dirty_pages_ratelimited_async() is
introduced that calls balance_dirty_pages_ratelimited_flags with no_wait
set to true.
If write throttling is enabled, it retuns -EAGAIN, so the write request
can be punted to the io-uring worker.

For non-async writes the current behavior is maintained.

Signed-off-by: Stefan Roesch <[email protected]>
---
 include/linux/writeback.h |  1 +
 mm/page-writeback.c       | 60 +++++++++++++++++++++++++++++----------
 2 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..15eb0242d3ef 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -373,6 +373,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
+int  balance_dirty_pages_ratelimited_async(struct address_space *mapping);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc3b79acd90b..d6a67fc07c55 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1851,28 +1851,18 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
  */
 DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied.  The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+static int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
+						bool no_wait)
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
+	int ret = 0;
 	int *p;
 
 	if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
-		return;
+		return ret;
 
 	if (inode_cgwb_enabled(inode))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
@@ -1912,12 +1902,52 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(wb, current->nr_dirtied, false);
+		balance_dirty_pages(wb, current->nr_dirtied, no_wait);
 
 	wb_put(wb);
+	return ret;
+}
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied.  The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+	balance_dirty_pages_ratelimited_flags(mapping, false);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
+/**
+ * balance_dirty_pages_ratelimited_async - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied.  The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ *
+ * This is the async version of the API. It only checks if it is required to
+ * balance dirty pages. In case it needs to balance dirty pages, it returns
+ * -EAGAIN.
+ */
+int  balance_dirty_pages_ratelimited_async(struct address_space *mapping)
+{
+	return balance_dirty_pages_ratelimited_flags(mapping, true);
+}
+EXPORT_SYMBOL(balance_dirty_pages_ratelimited_async);
+
 /**
  * wb_over_bg_thresh - does @wb need to be written back?
  * @wb: bdi_writeback of interest
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function
  2022-05-18 23:37 ` [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function Stefan Roesch
@ 2022-05-19  8:29   ` Christoph Hellwig
  2022-05-19  8:54     ` Jan Kara
  2022-05-20 18:29     ` Stefan Roesch
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:29 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

> +static int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
> +						bool no_wait)
>  {

This doesn't actully take flags, but a single boolean argument.  So
either it needs a new name, or we actually pass a descriptiv flag.

> +/**
> + * balance_dirty_pages_ratelimited_async - balance dirty memory state
> + * @mapping: address_space which was dirtied
> + *
> + * Processes which are dirtying memory should call in here once for each page
> + * which was newly dirtied.  The function will periodically check the system's
> + * dirty state and will initiate writeback if needed.
> + *
> + * Once we're over the dirty memory limit we decrease the ratelimiting
> + * by a lot, to prevent individual processes from overshooting the limit
> + * by (ratelimit_pages) each.
> + *
> + * This is the async version of the API. It only checks if it is required to
> + * balance dirty pages. In case it needs to balance dirty pages, it returns
> + * -EAGAIN.
> + */
> +int  balance_dirty_pages_ratelimited_async(struct address_space *mapping)
> +{
> +	return balance_dirty_pages_ratelimited_flags(mapping, true);
> +}
> +EXPORT_SYMBOL(balance_dirty_pages_ratelimited_async);

I'd much rather export the underlying
balance_dirty_pages_ratelimited_flags helper than adding a pointless
wrapper here.  And as long as only iomap is supported there is no need
to export it at all.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function
  2022-05-19  8:29   ` Christoph Hellwig
@ 2022-05-19  8:54     ` Jan Kara
  2022-05-20 18:32       ` Stefan Roesch
  2022-05-20 18:29     ` Stefan Roesch
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Kara @ 2022-05-19  8:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs,
	linux-fsdevel, david, jack

On Thu 19-05-22 01:29:21, Christoph Hellwig wrote:
> > +static int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
> > +						bool no_wait)
> >  {
> 
> This doesn't actully take flags, but a single boolean argument.  So
> either it needs a new name, or we actually pass a descriptiv flag.
>
> > +/**
> > + * balance_dirty_pages_ratelimited_async - balance dirty memory state
> > + * @mapping: address_space which was dirtied
> > + *
> > + * Processes which are dirtying memory should call in here once for each page
> > + * which was newly dirtied.  The function will periodically check the system's
> > + * dirty state and will initiate writeback if needed.
> > + *
> > + * Once we're over the dirty memory limit we decrease the ratelimiting
> > + * by a lot, to prevent individual processes from overshooting the limit
> > + * by (ratelimit_pages) each.
> > + *
> > + * This is the async version of the API. It only checks if it is required to
> > + * balance dirty pages. In case it needs to balance dirty pages, it returns
> > + * -EAGAIN.
> > + */
> > +int  balance_dirty_pages_ratelimited_async(struct address_space *mapping)
> > +{
> > +	return balance_dirty_pages_ratelimited_flags(mapping, true);
> > +}
> > +EXPORT_SYMBOL(balance_dirty_pages_ratelimited_async);
> 
> I'd much rather export the underlying
> balance_dirty_pages_ratelimited_flags helper than adding a pointless
> wrapper here.  And as long as only iomap is supported there is no need
> to export it at all.

This was actually my suggestion so I take the blame ;) I have suggested
this because I don't like non-static functions with bool arguments (it is
unnecessarily complicated to understand what the argument means or grep for
it etc.). If you don't like the wrapper, creating

int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
					  unsigned int flags)

and have something like:

#define BDP_NOWAIT 0x0001

is fine with me as well.

								Honza
-- 
Jan Kara <[email protected]>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function
  2022-05-19  8:54     ` Jan Kara
@ 2022-05-20 18:32       ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:32 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david



On 5/19/22 1:54 AM, Jan Kara wrote:
> On Thu 19-05-22 01:29:21, Christoph Hellwig wrote:
>>> +static int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
>>> +						bool no_wait)
>>>  {
>>
>> This doesn't actully take flags, but a single boolean argument.  So
>> either it needs a new name, or we actually pass a descriptiv flag.
>>
>>> +/**
>>> + * balance_dirty_pages_ratelimited_async - balance dirty memory state
>>> + * @mapping: address_space which was dirtied
>>> + *
>>> + * Processes which are dirtying memory should call in here once for each page
>>> + * which was newly dirtied.  The function will periodically check the system's
>>> + * dirty state and will initiate writeback if needed.
>>> + *
>>> + * Once we're over the dirty memory limit we decrease the ratelimiting
>>> + * by a lot, to prevent individual processes from overshooting the limit
>>> + * by (ratelimit_pages) each.
>>> + *
>>> + * This is the async version of the API. It only checks if it is required to
>>> + * balance dirty pages. In case it needs to balance dirty pages, it returns
>>> + * -EAGAIN.
>>> + */
>>> +int  balance_dirty_pages_ratelimited_async(struct address_space *mapping)
>>> +{
>>> +	return balance_dirty_pages_ratelimited_flags(mapping, true);
>>> +}
>>> +EXPORT_SYMBOL(balance_dirty_pages_ratelimited_async);
>>
>> I'd much rather export the underlying
>> balance_dirty_pages_ratelimited_flags helper than adding a pointless
>> wrapper here.  And as long as only iomap is supported there is no need
>> to export it at all.
> 
> This was actually my suggestion so I take the blame ;) I have suggested
> this because I don't like non-static functions with bool arguments (it is
> unnecessarily complicated to understand what the argument means or grep for
> it etc.). If you don't like the wrapper, creating
> 
> int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
> 					  unsigned int flags)
> 
> and have something like:
> 
> #define BDP_NOWAIT 0x0001
> 

I defined a BDP_ASYNC flag.

> is fine with me as well.
> 
> 								Honza

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function
  2022-05-19  8:29   ` Christoph Hellwig
  2022-05-19  8:54     ` Jan Kara
@ 2022-05-20 18:29     ` Stefan Roesch
  1 sibling, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:29 AM, Christoph Hellwig wrote:
>> +static int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
>> +						bool no_wait)
>>  {
> 
> This doesn't actully take flags, but a single boolean argument.  So
> either it needs a new name, or we actually pass a descriptiv flag.
> 
>> +/**
>> + * balance_dirty_pages_ratelimited_async - balance dirty memory state
>> + * @mapping: address_space which was dirtied
>> + *
>> + * Processes which are dirtying memory should call in here once for each page
>> + * which was newly dirtied.  The function will periodically check the system's
>> + * dirty state and will initiate writeback if needed.
>> + *
>> + * Once we're over the dirty memory limit we decrease the ratelimiting
>> + * by a lot, to prevent individual processes from overshooting the limit
>> + * by (ratelimit_pages) each.
>> + *
>> + * This is the async version of the API. It only checks if it is required to
>> + * balance dirty pages. In case it needs to balance dirty pages, it returns
>> + * -EAGAIN.
>> + */
>> +int  balance_dirty_pages_ratelimited_async(struct address_space *mapping)
>> +{
>> +	return balance_dirty_pages_ratelimited_flags(mapping, true);
>> +}
>> +EXPORT_SYMBOL(balance_dirty_pages_ratelimited_async);
> 
> I'd much rather export the underlying
> balance_dirty_pages_ratelimited_flags helper than adding a pointless
> wrapper here.  And as long as only iomap is supported there is no need
> to export it at all.
> 

Thats what I originally had. I'm reverting back to it.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (14 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-19  8:32   ` Christoph Hellwig
  2022-05-18 23:37 ` [RFC PATCH v3 17/18] io_uring: Add tracepoint for short writes Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 18/18] xfs: Enable async buffered write support Stefan Roesch
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This replaces the call to balance_dirty_pages_ratelimited() with the
call to balance_dirty_pages_ratelimited_flags. This allows to specify if
the write request is async or not.

In addition this also moves the above function call to the beginning of
the function. If the function call is at the end of the function and the
decision is made to throttle writes, then there is no request that
io-uring can wait on. By moving it to the beginning of the function, the
write request is not issued, but returns -EAGAIN instead. io-uring will
punt the request and process it in the io-worker.

By moving the function call to the beginning of the function, the write
throttling will happen one page later.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/iomap/buffered-io.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b029e2b10e07..2b85ddfa6ea1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -784,6 +784,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 	do {
 		struct folio *folio;
 		struct page *page;
+		struct address_space *i_mapping = iter->inode->i_mapping;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
@@ -792,6 +793,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		bytes = min_t(unsigned long, PAGE_SIZE - offset,
 						iov_iter_count(i));
 again:
+		if (iter->flags & IOMAP_NOWAIT) {
+			status = balance_dirty_pages_ratelimited_async(i_mapping);
+			if (unlikely(status))
+				break;
+		} else {
+			balance_dirty_pages_ratelimited(i_mapping);
+		}
+
 		if (bytes > length)
 			bytes = length;
 
@@ -815,7 +824,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			break;
 
 		page = folio_file_page(folio, pos >> PAGE_SHIFT);
-		if (mapping_writably_mapped(iter->inode->i_mapping))
+		if (mapping_writably_mapped(i_mapping))
 			flush_dcache_page(page);
 
 		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
@@ -840,8 +849,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		pos += status;
 		written += status;
 		length -= status;
-
-		balance_dirty_pages_ratelimited(iter->inode->i_mapping);
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter
  2022-05-18 23:37 ` [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter Stefan Roesch
@ 2022-05-19  8:32   ` Christoph Hellwig
  2022-05-20 18:31     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:32 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -784,6 +784,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  	do {
>  		struct folio *folio;
>  		struct page *page;
> +		struct address_space *i_mapping = iter->inode->i_mapping;

We tend to call these variables just mapping without the i_ prefix.

>  again:
> +		if (iter->flags & IOMAP_NOWAIT) {
> +			status = balance_dirty_pages_ratelimited_async(i_mapping);

Which also nicely avoids the overly long line here.

> +			if (unlikely(status))
> +				break;
> +		} else {
> +			balance_dirty_pages_ratelimited(i_mapping);
> +		}

Then again directly calling the underlying helper here would be simpler
to start with.

	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;

	...


		status = balance_dirty_pages_ratelimited_flags(mapping,
				bdp_flags);
		if (status)
			break;


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter
  2022-05-19  8:32   ` Christoph Hellwig
@ 2022-05-20 18:31     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:32 AM, Christoph Hellwig wrote:
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -784,6 +784,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  	do {
>>  		struct folio *folio;
>>  		struct page *page;
>> +		struct address_space *i_mapping = iter->inode->i_mapping;
> 
> We tend to call these variables just mapping without the i_ prefix.
> 

Will change the name to mapping.

>>  again:
>> +		if (iter->flags & IOMAP_NOWAIT) {
>> +			status = balance_dirty_pages_ratelimited_async(i_mapping);
> 
> Which also nicely avoids the overly long line here.
> 
>> +			if (unlikely(status))
>> +				break;
>> +		} else {
>> +			balance_dirty_pages_ratelimited(i_mapping);
>> +		}
> 
> Then again directly calling the underlying helper here would be simpler
> to start with.
> 
> 	unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
> 
> 	...
> 
> 
> 		status = balance_dirty_pages_ratelimited_flags(mapping,
> 				bdp_flags);
> 		if (status)
> 			break;
> 

I introduced the BDP_ASYNC define and used the above code. I also wired it
accordingly in balance_dirty_pages().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 17/18] io_uring: Add tracepoint for short writes
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (15 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-18 23:37 ` [RFC PATCH v3 18/18] xfs: Enable async buffered write support Stefan Roesch
  17 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/io_uring.c                   |  3 +++
 include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3aaac286509..7435a9c2007f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4032,6 +4032,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
 			struct io_async_rw *rw;
 
+			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+						req->result, ret2);
+
 			/* This is a partial write. The file pos has already been
 			 * updated, setup the async struct to complete the request
 			 * in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index cddf5b6fbeb4..661834361d33 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -543,6 +543,31 @@ TRACE_EVENT(io_uring_req_failed,
 		  (unsigned long long) __entry->pad2, __entry->error)
 );
 
+TRACE_EVENT(io_uring_short_write,
+
+	TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+	TP_ARGS(ctx, fpos, wanted, got),
+
+	TP_STRUCT__entry(
+		__field(void *,	ctx)
+		__field(u64,	fpos)
+		__field(u64,	wanted)
+		__field(u64,	got)
+	),
+
+	TP_fast_assign(
+		__entry->ctx	= ctx;
+		__entry->fpos	= fpos;
+		__entry->wanted	= wanted;
+		__entry->got	= got;
+	),
+
+	TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+			  __entry->ctx, __entry->fpos,
+			  __entry->wanted, __entry->got)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH v3 18/18] xfs: Enable async buffered write support
  2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (16 preceding siblings ...)
  2022-05-18 23:37 ` [RFC PATCH v3 17/18] io_uring: Add tracepoint for short writes Stefan Roesch
@ 2022-05-18 23:37 ` Stefan Roesch
  2022-05-19  8:32   ` Christoph Hellwig
  17 siblings, 1 reply; 35+ messages in thread
From: Stefan Roesch @ 2022-05-18 23:37 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
  Cc: shr, david, jack

This turns on the async buffered write support for XFS.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/xfs/xfs_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ad3175b7d366..af4fdc852da5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1169,7 +1169,7 @@ xfs_file_open(
 		return -EFBIG;
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
-	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
 	return 0;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 18/18] xfs: Enable async buffered write support
  2022-05-18 23:37 ` [RFC PATCH v3 18/18] xfs: Enable async buffered write support Stefan Roesch
@ 2022-05-19  8:32   ` Christoph Hellwig
  2022-05-20 18:32     ` Stefan Roesch
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-05-19  8:32 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack

On Wed, May 18, 2022 at 04:37:09PM -0700, Stefan Roesch wrote:
> This turns on the async buffered write support for XFS.

Can you group the patches by code they are touching, i.e, first
VFS enablement, the MM, then iomap, then xfs?  That shoud make
things a bit easier to review sequentially.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v3 18/18] xfs: Enable async buffered write support
  2022-05-19  8:32   ` Christoph Hellwig
@ 2022-05-20 18:32     ` Stefan Roesch
  0 siblings, 0 replies; 35+ messages in thread
From: Stefan Roesch @ 2022-05-20 18:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
	jack



On 5/19/22 1:32 AM, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 04:37:09PM -0700, Stefan Roesch wrote:
>> This turns on the async buffered write support for XFS.
> 
> Can you group the patches by code they are touching, i.e, first
> VFS enablement, the MM, then iomap, then xfs?  That shoud make
> things a bit easier to review sequentially.

I reordered the patches.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-05-20 18:33 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-18 23:36 [RFC PATCH v3 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 01/18] block: Add check for async buffered writes to generic_write_checks Stefan Roesch
2022-05-19  8:17   ` Christoph Hellwig
2022-05-20 18:23     ` Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 02/18] iomap: Add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
2022-05-19  8:18   ` Christoph Hellwig
2022-05-20 18:25     ` Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 03/18] iomap: Use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
2022-05-19  8:19   ` Christoph Hellwig
2022-05-20 18:26     ` Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 04/18] iomap: Add async buffered write support Stefan Roesch
2022-05-19  8:25   ` Christoph Hellwig
2022-05-20 18:29     ` Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 05/18] xfs: Add iomap " Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 06/18] fs: Split off remove_needs_file_privs() __remove_file_privs() Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 07/18] fs: Split off file_needs_update_time and __file_update_time Stefan Roesch
2022-05-18 23:36 ` [RFC PATCH v3 08/18] xfs: Enable async write file modification handling Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 09/18] fs: Optimization for concurrent file time updates Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 10/18] xfs: Add async buffered write support Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 11/18] io_uring: Add support for async buffered writes Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 12/18] mm: Move starting of background writeback into the main balancing loop Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 13/18] mm: Move updates of dirty_exceeded into one place Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 14/18] mm: Prepare balance_dirty_pages() for async buffered writes Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 15/18] mm: Add balance_dirty_pages_ratelimited_async() function Stefan Roesch
2022-05-19  8:29   ` Christoph Hellwig
2022-05-19  8:54     ` Jan Kara
2022-05-20 18:32       ` Stefan Roesch
2022-05-20 18:29     ` Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 16/18] iomap: Use balance_dirty_pages_ratelimited_flags in iomap_write_iter Stefan Roesch
2022-05-19  8:32   ` Christoph Hellwig
2022-05-20 18:31     ` Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 17/18] io_uring: Add tracepoint for short writes Stefan Roesch
2022-05-18 23:37 ` [RFC PATCH v3 18/18] xfs: Enable async buffered write support Stefan Roesch
2022-05-19  8:32   ` Christoph Hellwig
2022-05-20 18:32     ` Stefan Roesch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox