[PATCH v2 00/13] Support sync buffered writes for io-uring

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v2 00/13] Support sync buffered writes for io-uring
@ 2022-02-18 19:57 Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This patch series adds support for async buffered writes. Currently
io-uring only supports buffered writes in the slow path, by processing
them in the io workers. With this patch series it is now possible to
support buffered writes in the fast path. To be able to use the fast
path the required pages must be in the page cache or they can be loaded
with noio. Otherwise they still get punted to the slow path.

If a buffered write request requires more than one page, it is possible
that only part of the request can use the fast path, the resst will be
completed by the io workers.

Support for async buffered writes:

  Patch 1: fs: Add flags parameter to __block_write_begin_int
    Add a flag parameter to the function __block_write_begin_int
    to allow specifying a nowait parameter.
    
  Patch 2: mm: Introduce do_generic_perform_write
    Introduce a new do_generic_perform_write function. The function
    is split off from the existing generic_perform_write() function.
    It allows to specify an additional flag parameter. This parameter
    is used to specify the nowait flag.
    
  Patch 3: mm: Add support for async buffered writes
    For async buffered writes allocate pages without blocking on the
    allocation.

  Patch 4: fs: split off __alloc_page_buffers function
    Split off __alloc_page_buffers() function with new gfp_t parameter.

  Patch 5: fs: split off __create_empty_buffers function
    Split off __create_empty_buffers() function with new gfp_t parameter.

  Patch 6: fs: Add gfp_t parameter to create_page_buffers()
    Add gfp_t parameter to create_page_buffers() function. Use atomic
    allocation for async buffered writes.

  Patch 7: fs: add support for async buffered writes
    Return -EAGAIN instead of -ENOMEM for async buffered writes. This
    will cause the write request to be processed by an io worker.

  Patch 8: io_uring: add support for async buffered writes
    This enables the async buffered writes for block devices in io_uring.
    Buffered writes are enabled for blocks that are already in the page
    cache or can be acquired with noio.

  Patch 9: io_uring: Add tracepoint for short writes

Support for write throttling of async buffered writes:

  Patch 10: sched: add new fields to task_struct
    Add two new fields to the task_struct. These fields store the
    deadline after which writes are no longer throttled.

  Patch 11: mm: support write throttling for async buffered writes
    This changes the balance_dirty_pages function to take an additonal
    parameter. When nowait is specified the write throttling code no
    longer waits synchronously for the deadline to expire. Instead
    it sets the fields in task_struct. Once the deadline expires the
    fields are reset.
    
  Patch 12: io_uring: support write throttling for async buffered writes
    Adds support to io_uring for write throttling. When the writes
    are throttled, the write requests are added to the pending io list.
    Once the write throttling deadline expires, the writes are submitted.
    
Enable async buffered write support
  Patch 13: fs: add flag to support async buffered writes
    This sets the flags that enables async buffered writes for block
    devices.


Testing:
  This patch has been tested with xfstests and fio.


Peformance results:
  For fio the following results have been obtained with a queue depth of
  1 and 4k block size (runtime 600 secs):

                 sequential writes:
                 without patch                 with patch
  throughput:       329 Mib/s                    1032Mib/s
  iops:              82k                          264k
  slat (nsec)      2332                          3340 
  clat (nsec)      9017                            60
                   
  CPU util%:         37%                          78%



                 random writes:
                 without patch                 with patch
  throughput:       307 Mib/s                    909Mib/s
  iops:              76k                         227k
  slat (nsec)      2419                         3780 
  clat (nsec)      9934                           59

  CPU util%:         57%                          88%

For an io depth of 1, the new patch improves throughput by close to 3
times and also the latency is considerably reduced. To achieve the same
or better performance with the exisiting code an io depth of 4 is required.

Especially for mixed workloads this is a considerable improvement.


Changes:
V2: - removed patch 3 from patch series 1
    - replaced parameter aop_flags with with gfp_t in create_page_buffers()
    - Moved gfp flags to callers of create_page_buffers()
    - Removed changing of FGP_NOWAIT in __filemap_get_folio() and moved gfp
      flags to caller of __filemap_get_folio()
    - Renamed AOP_FLAGS_NOWAIT to AOP_FLAG_NOWAIT



Stefan Roesch (13):
  fs: Add flags parameter to __block_write_begin_int
  mm: Introduce do_generic_perform_write
  mm: Add support for async buffered writes
  fs: split off __alloc_page_buffers function
  fs: split off __create_empty_buffers function
  fs: Add gfp_t parameter to create_page_buffers()
  fs: add support for async buffered writes
  io_uring: add support for async buffered writes
  io_uring: Add tracepoint for short writes
  sched: add new fields to task_struct
  mm: support write throttling for async buffered writes
  io_uring: support write throttling for async buffered writes
  block: enable async buffered writes for block devices.

 block/fops.c                    |   5 +-
 fs/buffer.c                     |  98 +++++++++++++++---------
 fs/internal.h                   |   3 +-
 fs/io_uring.c                   | 130 +++++++++++++++++++++++++++++---
 fs/iomap/buffered-io.c          |   4 +-
 fs/read_write.c                 |   3 +-
 include/linux/fs.h              |   4 +
 include/linux/sched.h           |   3 +
 include/linux/writeback.h       |   1 +
 include/trace/events/io_uring.h |  25 ++++++
 kernel/fork.c                   |   1 +
 mm/filemap.c                    |  23 ++++--
 mm/folio-compat.c               |  12 ++-
 mm/page-writeback.c             |  54 +++++++++----
 14 files changed, 289 insertions(+), 77 deletions(-)


base-commit: 9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405
-- 
2.30.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:59   ` Matthew Wilcox
  2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds a flags parameter to the __begin_write_begin_int() function.
This allows to pass flags down the stack.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/buffer.c            | 7 ++++---
 fs/internal.h          | 3 ++-
 fs/iomap/buffered-io.c | 4 ++--
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8e112b6bd371..6e6a69a12eed 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1970,7 +1970,8 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 }
 
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
-		get_block_t *get_block, const struct iomap *iomap)
+			get_block_t *get_block, const struct iomap *iomap,
+			unsigned int flags)
 {
 	unsigned from = pos & (PAGE_SIZE - 1);
 	unsigned to = from + len;
@@ -2058,7 +2059,7 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		get_block_t *get_block)
 {
 	return __block_write_begin_int(page_folio(page), pos, len, get_block,
-				       NULL);
+				       NULL, 0);
 }
 EXPORT_SYMBOL(__block_write_begin);
 
@@ -2118,7 +2119,7 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 	if (!page)
 		return -ENOMEM;
 
-	status = __block_write_begin(page, pos, len, get_block);
+	status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, flags);
 	if (unlikely(status)) {
 		unlock_page(page);
 		put_page(page);
diff --git a/fs/internal.h b/fs/internal.h
index 8590c973c2f4..7432df23f3ce 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -38,7 +38,8 @@ static inline int emergency_thaw_bdev(struct super_block *sb)
  * buffer.c
  */
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
-		get_block_t *get_block, const struct iomap *iomap);
+			get_block_t *get_block, const struct iomap *iomap,
+			unsigned int flags);
 
 /*
  * char_dev.c
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6c51a75d0be6..47c519952725 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -646,7 +646,7 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	if (srcmap->type == IOMAP_INLINE)
 		status = iomap_write_begin_inline(iter, folio);
 	else if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
-		status = __block_write_begin_int(folio, pos, len, NULL, srcmap);
+		status = __block_write_begin_int(folio, pos, len, NULL, srcmap, 0);
 	else
 		status = __iomap_write_begin(iter, pos, len, folio);
 
@@ -979,7 +979,7 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 
 	if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD) {
 		ret = __block_write_begin_int(folio, iter->pos, length, NULL,
-					      &iter->iomap);
+					      &iter->iomap, 0);
 		if (ret)
 			return ret;
 		block_commit_write(&folio->page, 0, length);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
@ 2022-02-18 19:59   ` Matthew Wilcox
  2022-02-18 20:08     ` Stefan Roesch
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 19:59 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> This adds a flags parameter to the __begin_write_begin_int() function.
> This allows to pass flags down the stack.

Still no.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 19:59   ` Matthew Wilcox
@ 2022-02-18 20:08     ` Stefan Roesch
  2022-02-18 20:13       ` Matthew Wilcox
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team



On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>> This adds a flags parameter to the __begin_write_begin_int() function.
>> This allows to pass flags down the stack.
> 
> Still no.

Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
first have a patch that replaces the existing aop_flag parameter with the gfp_t?
and then modify this patch to directly use gfp flags?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:08     ` Stefan Roesch
@ 2022-02-18 20:13       ` Matthew Wilcox
  2022-02-18 20:14         ` Stefan Roesch
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:13 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
> 
> 
> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >> This adds a flags parameter to the __begin_write_begin_int() function.
> >> This allows to pass flags down the stack.
> > 
> > Still no.
> 
> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to

There is no function by that name in Linus' tree.

> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> and then modify this patch to directly use gfp flags?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:13       ` Matthew Wilcox
@ 2022-02-18 20:14         ` Stefan Roesch
  2022-02-18 20:22           ` Matthew Wilcox
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:14 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team



On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>> This allows to pass flags down the stack.
>>>
>>> Still no.
>>
>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
> 
> There is no function by that name in Linus' tree.
> 
>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>> and then modify this patch to directly use gfp flags?

s/block_begin_write_cache/block_write_begin/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:14         ` Stefan Roesch
@ 2022-02-18 20:22           ` Matthew Wilcox
  2022-02-18 20:25             ` Stefan Roesch
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:22 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
> 
> 
> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
> >>
> >>
> >> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> >>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >>>> This adds a flags parameter to the __begin_write_begin_int() function.
> >>>> This allows to pass flags down the stack.
> >>>
> >>> Still no.
> >>
> >> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
> > 
> > There is no function by that name in Linus' tree.
> > 
> >> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> >> and then modify this patch to directly use gfp flags?
> 
> s/block_begin_write_cache/block_write_begin/

I don't think there's any need to change the arguments to
block_write_begin().  That's widely used and I don't think changing
all the users is worth it.  You don't seem to call it anywhere in this
patch set.

But having block_write_begin() translate the aop flags into gfp
and fgp flags, yes.  It can call pagecache_get_page() instead of
grab_cache_page_write_begin().  And then you don't need to change
grab_cache_page_write_begin() at all.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:22           ` Matthew Wilcox
@ 2022-02-18 20:25             ` Stefan Roesch
  2022-02-18 20:35               ` Matthew Wilcox
  0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team



On 2/18/22 12:22 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>>>
>>>>
>>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>>>> This allows to pass flags down the stack.
>>>>>
>>>>> Still no.
>>>>
>>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
>>>
>>> There is no function by that name in Linus' tree.
>>>
>>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>>>> and then modify this patch to directly use gfp flags?
>>
>> s/block_begin_write_cache/block_write_begin/
> 
> I don't think there's any need to change the arguments to
> block_write_begin().  That's widely used and I don't think changing
> all the users is worth it.  You don't seem to call it anywhere in this
> patch set.
> 
> But having block_write_begin() translate the aop flags into gfp
> and fgp flags, yes.  It can call pagecache_get_page() instead of
> grab_cache_page_write_begin().  And then you don't need to change
> grab_cache_page_write_begin() at all.

That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
You are ok with that?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:25             ` Stefan Roesch
@ 2022-02-18 20:35               ` Matthew Wilcox
  2022-02-18 20:39                 ` Stefan Roesch
  0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:35 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 12:25:41PM -0800, Stefan Roesch wrote:
> 
> 
> On 2/18/22 12:22 PM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
> >>
> >>
> >> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> >>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
> >>>>
> >>>>
> >>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> >>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
> >>>>>> This allows to pass flags down the stack.
> >>>>>
> >>>>> Still no.
> >>>>
> >>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
> >>>
> >>> There is no function by that name in Linus' tree.
> >>>
> >>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> >>>> and then modify this patch to directly use gfp flags?
> >>
> >> s/block_begin_write_cache/block_write_begin/
> > 
> > I don't think there's any need to change the arguments to
> > block_write_begin().  That's widely used and I don't think changing
> > all the users is worth it.  You don't seem to call it anywhere in this
> > patch set.
> > 
> > But having block_write_begin() translate the aop flags into gfp
> > and fgp flags, yes.  It can call pagecache_get_page() instead of
> > grab_cache_page_write_begin().  And then you don't need to change
> > grab_cache_page_write_begin() at all.
> 
> That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
> You are ok with that?

No new AOP_FLAG.  block_write_begin() does not get called with
AOP_FLAG_NOWAIT in this series.  You'd want to pass gfp flags to
__block_write_begin_int instead of aop flags.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
  2022-02-18 20:35               ` Matthew Wilcox
@ 2022-02-18 20:39                 ` Stefan Roesch
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:39 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team



On 2/18/22 12:35 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:25:41PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 12:22 PM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
>>>>
>>>>
>>>> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
>>>>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>>>>>
>>>>>>
>>>>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>>>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>>>>>> This allows to pass flags down the stack.
>>>>>>>
>>>>>>> Still no.
>>>>>>
>>>>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
>>>>>
>>>>> There is no function by that name in Linus' tree.
>>>>>
>>>>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>>>>>> and then modify this patch to directly use gfp flags?
>>>>
>>>> s/block_begin_write_cache/block_write_begin/
>>>
>>> I don't think there's any need to change the arguments to
>>> block_write_begin().  That's widely used and I don't think changing
>>> all the users is worth it.  You don't seem to call it anywhere in this
>>> patch set.
>>>
>>> But having block_write_begin() translate the aop flags into gfp
>>> and fgp flags, yes.  It can call pagecache_get_page() instead of
>>> grab_cache_page_write_begin().  And then you don't need to change
>>> grab_cache_page_write_begin() at all.
>>
>> That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
>> You are ok with that?
> 
> No new AOP_FLAG.  block_write_begin() does not get called with
> AOP_FLAG_NOWAIT in this series.  You'd want to pass gfp flags to
> __block_write_begin_int instead of aop flags.

v2 of the patch series is using  AOP_FLAG_NOWAIT in block_write_begin().
Without introducing a new aop flag, how would I know in block_write_begin()
that the request is a nowait async buffered write?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 02/13] mm: Introduce do_generic_perform_write
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This splits off the do generic_perform_write() function, so an
additional flags parameter can be specified. It uses the new flag
parameter to support async buffered writes.

Signed-off-by: Stefan Roesch <[email protected]>
---
 include/linux/fs.h |  1 +
 mm/filemap.c       | 20 +++++++++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..b7dd5bd701c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -278,6 +278,7 @@ enum positive_aop_returns {
 #define AOP_FLAG_NOFS			0x0002 /* used by filesystem to direct
 						* helper code (eg buffer layer)
 						* to clear GFP_FS from alloc */
+#define AOP_FLAG_NOWAIT			0x0004 /* async nowait buffered writes */
 
 /*
  * oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index ad8c39d90bf9..5bd692a327d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3725,14 +3725,13 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
-ssize_t generic_perform_write(struct file *file,
-				struct iov_iter *i, loff_t pos)
+static ssize_t do_generic_perform_write(struct file *file, struct iov_iter *i,
+					loff_t pos, int flags)
 {
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
 	long status = 0;
 	ssize_t written = 0;
-	unsigned int flags = 0;
 
 	do {
 		struct page *page;
@@ -3801,6 +3800,12 @@ ssize_t generic_perform_write(struct file *file,
 
 	return written ? written : status;
 }
+
+ssize_t generic_perform_write(struct file *file,
+				struct iov_iter *i, loff_t pos)
+{
+	return do_generic_perform_write(file, i, pos, 0);
+}
 EXPORT_SYMBOL(generic_perform_write);
 
 /**
@@ -3832,6 +3837,10 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t		written = 0;
 	ssize_t		err;
 	ssize_t		status;
+	int		flags = 0;
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		flags |= AOP_FLAG_NOWAIT;
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = inode_to_bdi(inode);
@@ -3857,7 +3866,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
 
-		status = generic_perform_write(file, from, pos = iocb->ki_pos);
+		status = do_generic_perform_write(file, from, pos = iocb->ki_pos, flags);
+
 		/*
 		 * If generic_perform_write() returned a synchronous error
 		 * then we want to return the number of bytes which were
@@ -3889,7 +3899,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			 */
 		}
 	} else {
-		written = generic_perform_write(file, from, iocb->ki_pos);
+		written = do_generic_perform_write(file, from, iocb->ki_pos, flags);
 		if (likely(written > 0))
 			iocb->ki_pos += written;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 03/13] mm: Add support for async buffered writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds support for async buffered writes in the mm layer. When the
AOP_FLAG_NOWAIT flag is set, if the page is not already loaded,
the page gets created without blocking on the allocation.

Signed-off-by: Stefan Roesch <[email protected]>
---
 mm/filemap.c      |  1 +
 mm/folio-compat.c | 12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 5bd692a327d0..f4e2036c5029 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
 #include <linux/ramfs.h>
 #include <linux/page_idle.h>
 #include <linux/migrate.h>
+#include <linux/sched/mm.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include "internal.h"
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 749555a232a8..8243eeb883c1 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -133,11 +133,19 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	unsigned fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE;
+	gfp_t gfp = mapping_gfp_mask(mapping);
 
 	if (flags & AOP_FLAG_NOFS)
 		fgp_flags |= FGP_NOFS;
-	return pagecache_get_page(mapping, index, fgp_flags,
-			mapping_gfp_mask(mapping));
+
+	if (flags & AOP_FLAG_NOWAIT) {
+		fgp_flags |= FGP_NOWAIT;
+
+		gfp |= GFP_ATOMIC;
+		gfp &= ~__GFP_DIRECT_RECLAIM;
+	}
+
+	return pagecache_get_page(mapping, index, fgp_flags, gfp);
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (2 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 20:42   ` Matthew Wilcox
  2022-02-19  7:35   ` Christoph Hellwig
  2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This splits off the __alloc_page_buffers() function from the
alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
the caller can specify the allocation flags.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/buffer.c | 37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 6e6a69a12eed..2858eaf433c8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -802,26 +802,13 @@ int remove_inode_buffers(struct inode *inode)
 	return ret;
 }
 
-/*
- * Create the appropriate buffers when given a page for data area and
- * the size of each buffer.. Use the bh->b_this_page linked list to
- * follow the buffers created.  Return NULL if unable to create more
- * buffers.
- *
- * The retry flag is used to differentiate async IO (paging, swapping)
- * which may not fail from ordinary buffer allocations.
- */
-struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
-		bool retry)
+static struct buffer_head *__alloc_page_buffers(struct page *page,
+						unsigned long size, gfp_t gfp)
 {
 	struct buffer_head *bh, *head;
-	gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
 	long offset;
 	struct mem_cgroup *memcg, *old_memcg;
 
-	if (retry)
-		gfp |= __GFP_NOFAIL;
-
 	/* The page lock pins the memcg */
 	memcg = page_memcg(page);
 	old_memcg = set_active_memcg(memcg);
@@ -859,6 +846,26 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 
 	goto out;
 }
+
+/*
+ * Create the appropriate buffers when given a page for data area and
+ * the size of each buffer.. Use the bh->b_this_page linked list to
+ * follow the buffers created.  Return NULL if unable to create more
+ * buffers.
+ *
+ * The retry flag is used to differentiate async IO (paging, swapping)
+ * which may not fail from ordinary buffer allocations.
+ */
+struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
+		bool retry)
+{
+	gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
+
+	if (retry)
+		gfp |= __GFP_NOFAIL;
+
+	return __alloc_page_buffers(page, size, gfp);
+}
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 
 static inline void
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
@ 2022-02-18 20:42   ` Matthew Wilcox
  2022-02-18 20:50     ` Stefan Roesch
  2022-02-19  7:35   ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:42 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 11:57:30AM -0800, Stefan Roesch wrote:
> This splits off the __alloc_page_buffers() function from the
> alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
> the caller can specify the allocation flags.

This one only has six callers, so let's get the API right.  I suggest
making this:

struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
		gfp_t gfp)
{
	gfp |= __GFP_ACCOUNT;

and then all the existing callers specify either GFP_NOFS or
GFP_NOFS | __GFP_NOFAIL.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-18 20:42   ` Matthew Wilcox
@ 2022-02-18 20:50     ` Stefan Roesch
  0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team



On 2/18/22 12:42 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:57:30AM -0800, Stefan Roesch wrote:
>> This splits off the __alloc_page_buffers() function from the
>> alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
>> the caller can specify the allocation flags.
> 
> This one only has six callers, so let's get the API right.  I suggest
> making this:
> 
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> 		gfp_t gfp)
> {
> 	gfp |= __GFP_ACCOUNT;
> 
> and then all the existing callers specify either GFP_NOFS or
> GFP_NOFS | __GFP_NOFAIL.
> 


I can make that change, but i don't see how i can decide in block_write_begin()
to use different gfp flags when an async buffered write request is processed?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
  2022-02-18 20:42   ` Matthew Wilcox
@ 2022-02-19  7:35   ` Christoph Hellwig
  2022-02-20  4:23     ` Matthew Wilcox
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2022-02-19  7:35 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

Err, hell no.  Please do not add any new functionality to the legacy
buffer head code.  If you want new features do that on the
non-bufferhead iomap code path only please.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-19  7:35   ` Christoph Hellwig
@ 2022-02-20  4:23     ` Matthew Wilcox
  2022-02-20  4:38       ` Jens Axboe
  2022-02-22  8:18       ` Christoph Hellwig
  0 siblings, 2 replies; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-20  4:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
> Err, hell no.  Please do not add any new functionality to the legacy
> buffer head code.  If you want new features do that on the
> non-bufferhead iomap code path only please.

I think "first convert the block device code from buffer_heads to iomap"
might be a bit much of a prerequisite.  I think running ext4 on top of a
block device still requires buffer_heads, for example (I tried to convert
the block device to use mpage in order to avoid creating buffer_heads
when possible, and ext4 stopped working.  I didn't try too hard to debug
it as it was a bit of a distraction at the time).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-20  4:23     ` Matthew Wilcox
@ 2022-02-20  4:38       ` Jens Axboe
  2022-02-20  4:51         ` Jens Axboe
  2022-02-22  8:18       ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-02-20  4:38 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team

On 2/19/22 9:23 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>> Err, hell no.  Please do not add any new functionality to the legacy
>> buffer head code.  If you want new features do that on the
>> non-bufferhead iomap code path only please.
> 
> I think "first convert the block device code from buffer_heads to
> iomap" might be a bit much of a prerequisite.  I think running ext4 on
> top of a

Yes, that's exactly what Christoph was trying to say, but failing to
state in an appropriate manner. And we did actually discuss that, I'm
not against doing something like that.

> block device still requires buffer_heads, for example (I tried to convert
> the block device to use mpage in order to avoid creating buffer_heads
> when possible, and ext4 stopped working.  I didn't try too hard to debug
> it as it was a bit of a distraction at the time).

That's one of the main reasons why I didn't push this particular path,
as it is a bit fraught with weirdness and legacy buffer_head code which
isn't that easy to tackle...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-20  4:38       ` Jens Axboe
@ 2022-02-20  4:51         ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2022-02-20  4:51 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team

On 2/19/22 9:38 PM, Jens Axboe wrote:
> On 2/19/22 9:23 PM, Matthew Wilcox wrote:
>> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>>> Err, hell no.  Please do not add any new functionality to the legacy
>>> buffer head code.  If you want new features do that on the
>>> non-bufferhead iomap code path only please.
>>
>> I think "first convert the block device code from buffer_heads to
>> iomap" might be a bit much of a prerequisite.  I think running ext4 on
>> top of a
> 
> Yes, that's exactly what Christoph was trying to say, but failing to
> state in an appropriate manner. And we did actually discuss that, I'm
> not against doing something like that.

Just to be clear, I do agree with you that it's an unfair ask for this
change. And as you mentioned, ext4 would require the buffer_head code
to be touched anyway, just layering on top of the necessary changes
for the bdev code.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-20  4:23     ` Matthew Wilcox
  2022-02-20  4:38       ` Jens Axboe
@ 2022-02-22  8:18       ` Christoph Hellwig
  2022-02-22 23:19         ` Jens Axboe
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2022-02-22  8:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Stefan Roesch, io-uring, linux-fsdevel,
	linux-block, kernel-team

On Sun, Feb 20, 2022 at 04:23:50AM +0000, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
> > Err, hell no.  Please do not add any new functionality to the legacy
> > buffer head code.  If you want new features do that on the
> > non-bufferhead iomap code path only please.
> 
> I think "first convert the block device code from buffer_heads to iomap"
> might be a bit much of a prerequisite.  I think running ext4 on top of a
> block device still requires buffer_heads, for example (I tried to convert
> the block device to use mpage in order to avoid creating buffer_heads
> when possible, and ext4 stopped working.  I didn't try too hard to debug
> it as it was a bit of a distraction at the time).

Oh, I did not spot the users here is the block device.  Which is really
weird, why would anyone do buffered writes to a block devices?  Doing
so is a bit of a data integrity nightmare.

Can we please develop this feature for iomap based file systems first,
and if by then a use case for block devices arises I'll see what we can
do there.  I've been planning to get the block device code to stop using
buffer_heads by default, but taking them into account if used by a
legacy buffer_head user anyway.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
  2022-02-22  8:18       ` Christoph Hellwig
@ 2022-02-22 23:19         ` Jens Axboe
  0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2022-02-22 23:19 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team

On 2/22/22 1:18 AM, Christoph Hellwig wrote:
> On Sun, Feb 20, 2022 at 04:23:50AM +0000, Matthew Wilcox wrote:
>> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>>> Err, hell no.  Please do not add any new functionality to the legacy
>>> buffer head code.  If you want new features do that on the
>>> non-bufferhead iomap code path only please.
>>
>> I think "first convert the block device code from buffer_heads to iomap"
>> might be a bit much of a prerequisite.  I think running ext4 on top of a
>> block device still requires buffer_heads, for example (I tried to convert
>> the block device to use mpage in order to avoid creating buffer_heads
>> when possible, and ext4 stopped working.  I didn't try too hard to debug
>> it as it was a bit of a distraction at the time).
> 
> Oh, I did not spot the users here is the block device.  Which is
> really weird, why would anyone do buffered writes to a block devices?
> Doing so is a bit of a data integrity nightmare.
> 
> Can we please develop this feature for iomap based file systems first,
> and if by then a use case for block devices arises I'll see what we
> can do there.

The original plan wasn't to develop bdev async writes as a separate
useful feature, but rather to do it as a first step to both become
acquainted with the code base and solve some of the common issues for
both.

The fact that we need to touch buffer_heads for the bdev path is
annoying, and something that I'd very much rather just avoid. And
converting bdev to iomap first is a waste of time, exactly because it's
not a separately useful feature.

Hence I think we'll change gears here and start with iomap and XFS
instead.

> I've been planning to get the block device code to stop using
> buffer_heads by default, but taking them into account if used by a
> legacy buffer_head user anyway.

That would indeed be great, and to be honest, the current code for bdev
read/write doesn't make much sense outside of from a historical point of
view.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 05/13] fs: split off __create_empty_buffers function
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (3 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This splits off the function __create_empty_buffers() from the function
create_empty_buffers. The __create_empty_buffers has an additional gfp
parameter. This allows the caller to specify the allocation properties.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/buffer.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2858eaf433c8..648e1cba6da3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1554,17 +1554,12 @@ void block_invalidatepage(struct page *page, unsigned int offset,
 EXPORT_SYMBOL(block_invalidatepage);
 
 
-/*
- * We attach and possibly dirty the buffers atomically wrt
- * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
- * is already excluded via the page lock.
- */
-void create_empty_buffers(struct page *page,
-			unsigned long blocksize, unsigned long b_state)
+static void __create_empty_buffers(struct page *page, unsigned long blocksize,
+				unsigned long b_state, gfp_t gfp)
 {
 	struct buffer_head *bh, *head, *tail;
 
-	head = alloc_page_buffers(page, blocksize, true);
+	head = __alloc_page_buffers(page, blocksize, gfp);
 	bh = head;
 	do {
 		bh->b_state |= b_state;
@@ -1587,6 +1582,17 @@ void create_empty_buffers(struct page *page,
 	attach_page_private(page, head);
 	spin_unlock(&page->mapping->private_lock);
 }
+/*
+ * We attach and possibly dirty the buffers atomically wrt
+ * __set_page_dirty_buffers() via private_lock.  try_to_free_buffers
+ * is already excluded via the page lock.
+ */
+void create_empty_buffers(struct page *page,
+			unsigned long blocksize, unsigned long b_state)
+{
+	return __create_empty_buffers(page, blocksize, b_state,
+				GFP_NOFS | __GFP_ACCOUNT | __GFP_NOFAIL);
+}
 EXPORT_SYMBOL(create_empty_buffers);
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers()
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (4 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-21  0:18   ` kernel test robot
  2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds the gfp_t parameter to the create_page_buffers function.
This allows the caller to specify the required parameters.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/buffer.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 648e1cba6da3..ae588ae4b1c1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1682,13 +1682,20 @@ static inline int block_size_bits(unsigned int blocksize)
 	return ilog2(blocksize);
 }
 
-static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
+static struct buffer_head *create_page_buffers(struct page *page,
+					struct inode *inode,
+					unsigned int b_state,
+					gfp_t flags)
 {
 	BUG_ON(!PageLocked(page));
 
-	if (!page_has_buffers(page))
-		create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
-				     b_state);
+	if (!page_has_buffers(page)) {
+		gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT | flags;
+
+		__create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
+				     b_state, gfp);
+	}
+
 	return page_buffers(page);
 }
 
@@ -1734,7 +1741,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	int write_flags = wbc_to_write_flags(wbc);
 
 	head = create_page_buffers(page, inode,
-					(1 << BH_Dirty)|(1 << BH_Uptodate));
+					(1 << BH_Dirty)|(1 << BH_Uptodate), __GFP_NOFAIL);
 
 	/*
 	 * Be very careful.  We have no exclusion from __set_page_dirty_buffers
@@ -2000,7 +2007,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 	BUG_ON(to > PAGE_SIZE);
 	BUG_ON(from > to);
 
-	head = create_page_buffers(&folio->page, inode, 0);
+	head = create_page_buffers(&folio->page, inode, 0, flags);
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 
@@ -2127,12 +2134,17 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 	pgoff_t index = pos >> PAGE_SHIFT;
 	struct page *page;
 	int status;
+	gfp_t gfp = 0;
+	bool no_wait = (flags & AOP_FLAG_NOWAIT);
+
+	if (no_wait)
+		gfp = GFP_ATOMIC | __GFP_NOWARN;
 
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
 
-	status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, flags);
+	status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
 	if (unlikely(status)) {
 		unlock_page(page);
 		put_page(page);
@@ -2280,7 +2292,7 @@ int block_read_full_page(struct page *page, get_block_t *get_block)
 	int nr, i;
 	int fully_mapped = 1;
 
-	head = create_page_buffers(page, inode, 0);
+	head = create_page_buffers(page, inode, 0, __GFP_NOFAIL);
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers()
  2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
@ 2022-02-21  0:18   ` kernel test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2022-02-21  0:18 UTC (permalink / raw)
  To: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
  Cc: kbuild-all, shr

Hi Stefan,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405]

url:    https://github.com/0day-ci/linux/commits/Stefan-Roesch/Support-sync-buffered-writes-for-io-uring/20220220-172629
base:   9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405
config: sparc64-randconfig-s031-20220220 (https://download.01.org/0day-ci/archive/20220221/[email protected]/config)
compiler: sparc64-linux-gcc (GCC) 11.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/e98a7c2a17960f81efc5968cbc386af7c088a8ed
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Stefan-Roesch/Support-sync-buffered-writes-for-io-uring/20220220-172629
        git checkout e98a7c2a17960f81efc5968cbc386af7c088a8ed
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)
>> fs/buffer.c:2010:60: sparse: sparse: incorrect type in argument 4 (different base types) @@     expected restricted gfp_t [usertype] flags @@     got unsigned int flags @@
   fs/buffer.c:2010:60: sparse:     expected restricted gfp_t [usertype] flags
   fs/buffer.c:2010:60: sparse:     got unsigned int flags
>> fs/buffer.c:2147:87: sparse: sparse: incorrect type in argument 6 (different base types) @@     expected unsigned int flags @@     got restricted gfp_t [assigned] [usertype] gfp @@
   fs/buffer.c:2147:87: sparse:     expected unsigned int flags
   fs/buffer.c:2147:87: sparse:     got restricted gfp_t [assigned] [usertype] gfp

vim +2010 fs/buffer.c

  1991	
  1992	int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
  1993				get_block_t *get_block, const struct iomap *iomap,
  1994				unsigned int flags)
  1995	{
  1996		unsigned from = pos & (PAGE_SIZE - 1);
  1997		unsigned to = from + len;
  1998		struct inode *inode = folio->mapping->host;
  1999		unsigned block_start, block_end;
  2000		sector_t block;
  2001		int err = 0;
  2002		unsigned blocksize, bbits;
  2003		struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
  2004	
  2005		BUG_ON(!folio_test_locked(folio));
  2006		BUG_ON(from > PAGE_SIZE);
  2007		BUG_ON(to > PAGE_SIZE);
  2008		BUG_ON(from > to);
  2009	
> 2010		head = create_page_buffers(&folio->page, inode, 0, flags);
  2011		blocksize = head->b_size;
  2012		bbits = block_size_bits(blocksize);
  2013	
  2014		block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
  2015	
  2016		for(bh = head, block_start = 0; bh != head || !block_start;
  2017		    block++, block_start=block_end, bh = bh->b_this_page) {
  2018			block_end = block_start + blocksize;
  2019			if (block_end <= from || block_start >= to) {
  2020				if (folio_test_uptodate(folio)) {
  2021					if (!buffer_uptodate(bh))
  2022						set_buffer_uptodate(bh);
  2023				}
  2024				continue;
  2025			}
  2026			if (buffer_new(bh))
  2027				clear_buffer_new(bh);
  2028			if (!buffer_mapped(bh)) {
  2029				WARN_ON(bh->b_size != blocksize);
  2030				if (get_block) {
  2031					err = get_block(inode, block, bh, 1);
  2032					if (err)
  2033						break;
  2034				} else {
  2035					iomap_to_bh(inode, block, bh, iomap);
  2036				}
  2037	
  2038				if (buffer_new(bh)) {
  2039					clean_bdev_bh_alias(bh);
  2040					if (folio_test_uptodate(folio)) {
  2041						clear_buffer_new(bh);
  2042						set_buffer_uptodate(bh);
  2043						mark_buffer_dirty(bh);
  2044						continue;
  2045					}
  2046					if (block_end > to || block_start < from)
  2047						folio_zero_segments(folio,
  2048							to, block_end,
  2049							block_start, from);
  2050					continue;
  2051				}
  2052			}
  2053			if (folio_test_uptodate(folio)) {
  2054				if (!buffer_uptodate(bh))
  2055					set_buffer_uptodate(bh);
  2056				continue; 
  2057			}
  2058			if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
  2059			    !buffer_unwritten(bh) &&
  2060			     (block_start < from || block_end > to)) {
  2061				ll_rw_block(REQ_OP_READ, 0, 1, &bh);
  2062				*wait_bh++=bh;
  2063			}
  2064		}
  2065		/*
  2066		 * If we issued read requests - let them complete.
  2067		 */
  2068		while(wait_bh > wait) {
  2069			wait_on_buffer(*--wait_bh);
  2070			if (!buffer_uptodate(*wait_bh))
  2071				err = -EIO;
  2072		}
  2073		if (unlikely(err))
  2074			page_zero_new_buffers(&folio->page, from, to);
  2075		return err;
  2076	}
  2077	
  2078	int __block_write_begin(struct page *page, loff_t pos, unsigned len,
  2079			get_block_t *get_block)
  2080	{
  2081		return __block_write_begin_int(page_folio(page), pos, len, get_block,
  2082					       NULL, 0);
  2083	}
  2084	EXPORT_SYMBOL(__block_write_begin);
  2085	
  2086	static int __block_commit_write(struct inode *inode, struct page *page,
  2087			unsigned from, unsigned to)
  2088	{
  2089		unsigned block_start, block_end;
  2090		int partial = 0;
  2091		unsigned blocksize;
  2092		struct buffer_head *bh, *head;
  2093	
  2094		bh = head = page_buffers(page);
  2095		blocksize = bh->b_size;
  2096	
  2097		block_start = 0;
  2098		do {
  2099			block_end = block_start + blocksize;
  2100			if (block_end <= from || block_start >= to) {
  2101				if (!buffer_uptodate(bh))
  2102					partial = 1;
  2103			} else {
  2104				set_buffer_uptodate(bh);
  2105				mark_buffer_dirty(bh);
  2106			}
  2107			if (buffer_new(bh))
  2108				clear_buffer_new(bh);
  2109	
  2110			block_start = block_end;
  2111			bh = bh->b_this_page;
  2112		} while (bh != head);
  2113	
  2114		/*
  2115		 * If this is a partial write which happened to make all buffers
  2116		 * uptodate then we can optimize away a bogus readpage() for
  2117		 * the next read(). Here we 'discover' whether the page went
  2118		 * uptodate as a result of this (potentially partial) write.
  2119		 */
  2120		if (!partial)
  2121			SetPageUptodate(page);
  2122		return 0;
  2123	}
  2124	
  2125	/*
  2126	 * block_write_begin takes care of the basic task of block allocation and
  2127	 * bringing partial write blocks uptodate first.
  2128	 *
  2129	 * The filesystem needs to handle block truncation upon failure.
  2130	 */
  2131	int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
  2132			unsigned flags, struct page **pagep, get_block_t *get_block)
  2133	{
  2134		pgoff_t index = pos >> PAGE_SHIFT;
  2135		struct page *page;
  2136		int status;
  2137		gfp_t gfp = 0;
  2138		bool no_wait = (flags & AOP_FLAG_NOWAIT);
  2139	
  2140		if (no_wait)
  2141			gfp = GFP_ATOMIC | __GFP_NOWARN;
  2142	
  2143		page = grab_cache_page_write_begin(mapping, index, flags);
  2144		if (!page)
  2145			return -ENOMEM;
  2146	
> 2147		status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
  2148		if (unlikely(status)) {
  2149			unlock_page(page);
  2150			put_page(page);
  2151			page = NULL;
  2152		}
  2153	
  2154		*pagep = page;
  2155		return status;
  2156	}
  2157	EXPORT_SYMBOL(block_write_begin);
  2158	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 07/13] fs: add support for async buffered writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (5 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds support for the AOP_FLAGS_BUF_WASYNC flag to the fs layer. If
a page that is required for writing is not in the page cache, it returns
EAGAIN instead of ENOMEM.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/buffer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index ae588ae4b1c1..58331ef214b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2062,6 +2062,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 			*wait_bh++=bh;
 		}
 	}
+
 	/*
 	 * If we issued read requests - let them complete.
 	 */
@@ -2141,8 +2142,11 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 		gfp = GFP_ATOMIC | __GFP_NOWARN;
 
 	page = grab_cache_page_write_begin(mapping, index, flags);
-	if (!page)
+	if (!page) {
+		if (no_wait)
+			return -EAGAIN;
 		return -ENOMEM;
+	}
 
 	status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
 	if (unlikely(status)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 08/13] io_uring: add support for async buffered writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (6 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This enables the async buffered writes for block devices in io_uring.
Buffered writes are enabled for blocks that are already in the page
cache or can be acquired with noio.

It is possible that a write request cannot be completely fullfilled
(short write). In that case the request is punted and sent to the io
workers to be completed. Before submitting the request to the io
workers, the request is updated with how much has already been written.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/io_uring.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b9c7e4793b..52bd88908afd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3615,7 +3615,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
 		return -EINVAL;
 }
 
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
 {
 	return req->flags & REQ_F_ISREG ||
 		S_ISBLK(file_inode(req->file)->i_mode);
@@ -3679,7 +3679,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
 	} else if (ret == req->result || ret <= 0 || !force_nonblock ||
-		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
@@ -3777,9 +3777,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(!io_file_supports_nowait(req)))
 			goto copy_iov;
 
-		/* file path doesn't support NOWAIT for non-direct_IO */
-		if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
-		    (req->flags & REQ_F_ISREG))
+		/* File path supports NOWAIT for non-direct_IO only for block devices. */
+		if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+			!(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+			(req->flags & REQ_F_ISREG))
 			goto copy_iov;
 
 		kiocb->ki_flags |= IOCB_NOWAIT;
@@ -3831,6 +3832,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		/* IOPOLL retry should happen for io-wq threads */
 		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
 			goto copy_iov;
+
+		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
+			struct io_async_rw *rw;
+
+			/* This is a partial write. The file pos has already been
+			 * updated, setup the async struct to complete the request
+			 * in the worker. Also update bytes_done to account for
+			 * the bytes already written.
+			 */
+			iov_iter_save_state(&s->iter, &s->iter_state);
+			ret = io_setup_async_rw(req, iovec, s, true);
+
+			rw = req->async_data;
+			if (rw)
+				rw->bytes_done += ret2;
+
+			return ret ? ret : -EAGAIN;
+		}
 done:
 		kiocb_done(req, ret2, issue_flags);
 	} else {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 09/13] io_uring: Add tracepoint for short writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (7 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/io_uring.c                   |  3 +++
 include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 52bd88908afd..792ca4b6834d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3836,6 +3836,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
 			struct io_async_rw *rw;
 
+			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+						req->result, ret2);
+
 			/* This is a partial write. The file pos has already been
 			 * updated, setup the async struct to complete the request
 			 * in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..ce1cfdf4b015 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -558,6 +558,31 @@ TRACE_EVENT(io_uring_req_failed,
 		  (unsigned long long) __entry->pad2, __entry->error)
 );
 
+TRACE_EVENT(io_uring_short_write,
+
+	TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+	TP_ARGS(ctx, fpos, wanted, got),
+
+	TP_STRUCT__entry(
+		__field(void *,	ctx)
+		__field(u64,	fpos)
+		__field(u64,	wanted)
+		__field(u64,	got)
+	),
+
+	TP_fast_assign(
+		__entry->ctx	= ctx;
+		__entry->fpos	= fpos;
+		__entry->wanted	= wanted;
+		__entry->got	= got;
+	),
+
+	TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+			  __entry->ctx, __entry->fpos,
+			  __entry->wanted, __entry->got)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 10/13] sched: add new fields to task_struct
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (8 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

Add two new fields to the task_struct to support async
write throttling.

  - One field to store how long writes are throttled: bdp_pause
  - The other field to store the number of dirtied pages:
    bdp_nr_dirtied_pause

Signed-off-by: Stefan Roesch <[email protected]>
---
 include/linux/sched.h | 3 +++
 kernel/fork.c         | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..97146b7539c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1324,6 +1324,9 @@ struct task_struct {
 	/* Start of a write-and-pause period: */
 	unsigned long			dirty_paused_when;
 
+	unsigned long			bdp_pause;
+	int				bdp_nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int				latency_record_count;
 	struct latency_record		latency_record[LT_SAVECOUNT];
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..d34c9c00baea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2246,6 +2246,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
 	p->dirty_paused_when = 0;
+	p->bdp_nr_dirtied_pause = -1;
 
 	p->pdeath_signal = 0;
 	INIT_LIST_HEAD(&p->thread_group);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 11/13] mm: support write throttling for async buffered writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (9 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This change adds support for async write throttling in the function
balance_dirty_pages(). So far if throttling was required, the code
was waiting synchronously as long as the writes were throttled. This
change introduces asynchronous throttling. Instead of waiting in the
function balance_dirty_pages(), the timeout is set in the task_struct
field bdp_pause. Once the timeout has expired, the writes are no
longer throttled.

- Add a new parameter to the balance_dirty_pages() function
  - This allows the caller to pass in the nowait flag
  - When the nowait flag is specified, the code does not wait in
    balance_dirty_pages(), but instead stores the wait expiration in the
    new task_struct field bdp_pause.

- The function balance_dirty_pages_ratelimited() resets the new values
  in the task_struct, once the timeout has expired

This change is required to support write throttling for the async
buffered writes. While the writes are throttled, io_uring still can make
progress with processing other requests.

Signed-off-by: Stefan Roesch <[email protected]>
---
 include/linux/writeback.h |  1 +
 mm/filemap.c              |  2 +-
 mm/page-writeback.c       | 54 ++++++++++++++++++++++++++++-----------
 3 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..48176a8047db 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -373,6 +373,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void  balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index f4e2036c5029..642a4e814869 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3796,7 +3796,7 @@ static ssize_t do_generic_perform_write(struct file *file, struct iov_iter *i,
 		pos += status;
 		written += status;
 
-		balance_dirty_pages_ratelimited(mapping);
+		balance_dirty_pages_ratelimited_flags(mapping, flags & AOP_FLAG_NOWAIT);
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 91d163f8d36b..767d0b997da5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1558,7 +1558,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * perform some writeout.
  */
 static void balance_dirty_pages(struct bdi_writeback *wb,
-				unsigned long pages_dirtied)
+				unsigned long pages_dirtied, bool is_async)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
 	struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1792,6 +1792,14 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 					  period,
 					  pause,
 					  start_time);
+		if (is_async) {
+			if (current->bdp_nr_dirtied_pause == -1) {
+				current->bdp_pause = now + pause;
+				current->bdp_nr_dirtied_pause = nr_dirtied_pause;
+			}
+			break;
+		}
+
 		__set_current_state(TASK_KILLABLE);
 		wb->dirty_sleep = now;
 		io_schedule_timeout(pause);
@@ -1799,6 +1807,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		current->dirty_paused_when = now + pause;
 		current->nr_dirtied = 0;
 		current->nr_dirtied_pause = nr_dirtied_pause;
+		current->bdp_nr_dirtied_pause = -1;
+		current->bdp_pause = 0;
 
 		/*
 		 * This is typically equal to (dirty < thresh) and can also
@@ -1863,19 +1873,7 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
  */
 DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied.  The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async)
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -1886,6 +1884,15 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
 		return;
 
+	if (current->bdp_nr_dirtied_pause != -1 && time_after(jiffies, current->bdp_pause)) {
+		current->dirty_paused_when = current->bdp_pause;
+		current->nr_dirtied = 0;
+		current->nr_dirtied_pause = current->bdp_nr_dirtied_pause;
+
+		current->bdp_nr_dirtied_pause = -1;
+		current->bdp_pause = 0;
+	}
+
 	if (inode_cgwb_enabled(inode))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
 	if (!wb)
@@ -1924,10 +1931,27 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied, is_async);
 
 	wb_put(wb);
 }
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied.  The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+	balance_dirty_pages_ratelimited_flags(mapping, false);
+}
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
 /**
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 12/13] io_uring: support write throttling for async buffered writes
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (10 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
  2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This adds the process-level throttling for the block layer for async
buffered writes to io-uring.In io_write the code now checks if the write
needs to be throttled. If this is required, it adds the request to the
list of pending io requests and starts a timer. After the timer expires,
it submits the list of pending writes.

- Add new list called pending_ios for delayed writes (throttled writes)
  to struct io_uring_task. The list is protected by the task_lock spin
  lock.
- Add new timer to struct io_uring_task.

Signed-off-by: Stefan Roesch <[email protected]>
---
 fs/io_uring.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 91 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 792ca4b6834d..8a48e5ee4e5e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -461,6 +461,11 @@ struct io_ring_ctx {
 	};
 };
 
+struct pending_list {
+	struct list_head list;
+	struct io_kiocb *req;
+};
+
 struct io_uring_task {
 	/* submission side */
 	int			cached_refs;
@@ -477,6 +482,9 @@ struct io_uring_task {
 	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
+
+	struct pending_list	pending_ios;
+	struct timer_list	timer;
 };
 
 /*
@@ -1134,13 +1142,14 @@ static void io_rsrc_put_work(struct work_struct *work);
 
 static void io_req_task_queue(struct io_kiocb *req);
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
-static int io_req_prep_async(struct io_kiocb *req);
+static int io_req_prep_async(struct io_kiocb *req, bool force);
 
 static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
 				 unsigned int issue_flags, u32 slot_index);
 static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
 
 static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
+static void delayed_write_fn(struct timer_list *tmr);
 
 static struct kmem_cache *req_cachep;
 
@@ -2462,6 +2471,31 @@ static void io_req_task_queue_reissue(struct io_kiocb *req)
 	io_req_task_work_add(req, false);
 }
 
+static int io_req_task_queue_reissue_delayed(struct io_kiocb *req)
+{
+	struct io_uring_task *tctx = req->task->io_uring;
+	struct pending_list *pending = kmalloc(sizeof(struct pending_list), GFP_KERNEL);
+	bool empty;
+
+	if (!pending)
+		return -ENOMEM;
+	pending->req = req;
+
+	spin_lock_irq(&tctx->task_lock);
+	empty = list_empty(&tctx->pending_ios.list);
+	list_add_tail(&pending->list, &tctx->pending_ios.list);
+
+	if (empty) {
+		timer_setup(&tctx->timer, delayed_write_fn, 0);
+
+		tctx->timer.expires = current->bdp_pause;
+		add_timer(&tctx->timer);
+	}
+	spin_unlock_irq(&tctx->task_lock);
+
+	return 0;
+}
+
 static inline void io_queue_next(struct io_kiocb *req)
 {
 	struct io_kiocb *nxt = io_req_find_next(req);
@@ -2770,7 +2804,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 	struct io_async_rw *rw = req->async_data;
 
 	if (!req_has_async_data(req))
-		return !io_req_prep_async(req);
+		return !io_req_prep_async(req, false);
 	iov_iter_restore(&rw->s.iter, &rw->s.iter_state);
 	return true;
 }
@@ -3751,6 +3785,38 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return io_prep_rw(req, sqe);
 }
 
+static inline unsigned long write_delay(void)
+{
+	if (likely(current->bdp_nr_dirtied_pause == -1 ||
+			!time_before(jiffies, current->bdp_pause)))
+		return 0;
+
+	return current->bdp_pause;
+}
+
+static void delayed_write_fn(struct timer_list *tmr)
+{
+	struct io_uring_task *tctx = from_timer(tctx, tmr, timer);
+	struct list_head *curr;
+	struct list_head *next;
+	LIST_HEAD(pending_ios);
+
+	/* Move list to temporary list. */
+	spin_lock_irq(&tctx->task_lock);
+	list_splice_init(&tctx->pending_ios.list, &pending_ios);
+	spin_unlock_irq(&tctx->task_lock);
+
+	list_for_each_safe(curr, next, &pending_ios) {
+		struct pending_list *io;
+
+		io = list_entry(curr, struct pending_list, list);
+		io_req_task_queue_reissue(io->req);
+
+		list_del(curr);
+		kfree(io);
+	}
+}
+
 static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rw_state __s, *s = &__s;
@@ -3759,6 +3825,18 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	ssize_t ret, ret2;
 
+	/* Write throttling active? */
+	if (unlikely(write_delay()) && !(kiocb->ki_flags & IOCB_DIRECT)) {
+		int ret = io_req_prep_async(req, true);
+
+		if (unlikely(ret))
+			io_req_complete_failed(req, ret);
+		else
+			ret = io_req_task_queue_reissue_delayed(req);
+
+		return ret;
+	}
+
 	if (!req_has_async_data(req)) {
 		ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags);
 		if (unlikely(ret < 0))
@@ -6596,9 +6674,9 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return -EINVAL;
 }
 
-static int io_req_prep_async(struct io_kiocb *req)
+static int io_req_prep_async(struct io_kiocb *req, bool force)
 {
-	if (!io_op_defs[req->opcode].needs_async_setup)
+	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (WARN_ON_ONCE(req_has_async_data(req)))
 		return -EFAULT;
@@ -6608,6 +6686,10 @@ static int io_req_prep_async(struct io_kiocb *req)
 	switch (req->opcode) {
 	case IORING_OP_READV:
 		return io_rw_prep_async(req, READ);
+	case IORING_OP_WRITE:
+		if (!force)
+			break;
+		fallthrough;
 	case IORING_OP_WRITEV:
 		return io_rw_prep_async(req, WRITE);
 	case IORING_OP_SENDMSG:
@@ -6617,6 +6699,7 @@ static int io_req_prep_async(struct io_kiocb *req)
 	case IORING_OP_CONNECT:
 		return io_connect_prep_async(req);
 	}
+
 	printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
 		    req->opcode);
 	return -EFAULT;
@@ -6650,7 +6733,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	}
 	spin_unlock(&ctx->completion_lock);
 
-	ret = io_req_prep_async(req);
+	ret = io_req_prep_async(req, false);
 	if (ret) {
 fail:
 		io_req_complete_failed(req, ret);
@@ -7145,7 +7228,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 	} else if (unlikely(req->ctx->drain_active)) {
 		io_drain_req(req);
 	} else {
-		int ret = io_req_prep_async(req);
+		int ret = io_req_prep_async(req, false);
 
 		if (unlikely(ret))
 			io_req_complete_failed(req, ret);
@@ -7344,7 +7427,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		struct io_kiocb *head = link->head;
 
 		if (!(req->flags & REQ_F_FAIL)) {
-			ret = io_req_prep_async(req);
+			ret = io_req_prep_async(req, false);
 			if (unlikely(ret)) {
 				req_fail_link_node(req, ret);
 				if (!(head->flags & REQ_F_FAIL))
@@ -8784,6 +8867,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	INIT_WQ_LIST(&tctx->task_list);
 	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
+	INIT_LIST_HEAD(&tctx->pending_ios.list);
 	return 0;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 13/13] block: enable async buffered writes for block devices.
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (11 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
  2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner
  13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
  To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr

This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also enables async buffered
writes for block devices.

Signed-off-by: Stefan Roesch <[email protected]>
---
 block/fops.c       | 5 +----
 fs/read_write.c    | 3 ++-
 include/linux/fs.h | 3 +++
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 4f59e0f5bf30..75b36f8b5e71 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -489,7 +489,7 @@ static int blkdev_open(struct inode *inode, struct file *filp)
 	 * during an unstable branch.
 	 */
 	filp->f_flags |= O_LARGEFILE;
-	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
 
 	if (filp->f_flags & O_NDELAY)
 		filp->f_mode |= FMODE_NDELAY;
@@ -544,9 +544,6 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos >= size)
 		return -ENOSPC;
 
-	if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
-		return -EOPNOTSUPP;
-
 	size -= iocb->ki_pos;
 	if (iov_iter_count(from) > size) {
 		shorted = iov_iter_count(from) - size;
diff --git a/fs/read_write.c b/fs/read_write.c
index 0074afa7ecb3..58233844a9d8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1641,7 +1641,8 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_flags & IOCB_APPEND)
 		iocb->ki_pos = i_size_read(inode);
 
-	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+		(!(iocb->ki_flags & IOCB_DIRECT) && !(file->f_mode & FMODE_BUF_WASYNC)))
 		return -EINVAL;
 
 	count = iov_iter_count(from);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b7dd5bd701c0..a526de71b932 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -176,6 +176,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] Support sync buffered writes for io-uring
  2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
                   ` (12 preceding siblings ...)
  2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
@ 2022-02-20 22:38 ` Dave Chinner
  13 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-02-20 22:38 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team

On Fri, Feb 18, 2022 at 11:57:26AM -0800, Stefan Roesch wrote:
> This patch series adds support for async buffered writes. Currently
> io-uring only supports buffered writes in the slow path, by processing
> them in the io workers. With this patch series it is now possible to
> support buffered writes in the fast path. To be able to use the fast
> path the required pages must be in the page cache or they can be loaded
> with noio. Otherwise they still get punted to the slow path.

Where's the filesystem support? You need to plumb in ext4 to this
bufferhead support, and add iomap/xfs support as well so we can
shake out all the problems with APIs and fallback paths that are
needed for full support of buffered writes via io_uring.

> If a buffered write request requires more than one page, it is possible
> that only part of the request can use the fast path, the resst will be
> completed by the io workers.

That's ugly, especially at the filesystem/iomap layer where we are
doing delayed allocation and so partial writes like this could have
significant extra impact. It opens up the possibility of things like
ENOSPC/EDQUOT mid-way through the write instead of being an up-front
error, and so there's lots more complexity in the failure/fallback
paths that the io_uring infrastructure will have to handle
correctly...

Also, it breaks the "atomic buffered write" design of iomap/XFS
where other readers and writers will only see whole completed writes
and not intermediate partial writes. This is where a lot of the bugs
in the DIO io_uring support were found (deadlocks, data corruptions,
etc), so there's a bunch of semantic and API issues that filesystems
require from io_uring that need to be sorted out before we think
about merge buffered write support...

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-02-22 23:19 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
2022-02-18 19:59   ` Matthew Wilcox
2022-02-18 20:08     ` Stefan Roesch
2022-02-18 20:13       ` Matthew Wilcox
2022-02-18 20:14         ` Stefan Roesch
2022-02-18 20:22           ` Matthew Wilcox
2022-02-18 20:25             ` Stefan Roesch
2022-02-18 20:35               ` Matthew Wilcox
2022-02-18 20:39                 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
2022-02-18 20:42   ` Matthew Wilcox
2022-02-18 20:50     ` Stefan Roesch
2022-02-19  7:35   ` Christoph Hellwig
2022-02-20  4:23     ` Matthew Wilcox
2022-02-20  4:38       ` Jens Axboe
2022-02-20  4:51         ` Jens Axboe
2022-02-22  8:18       ` Christoph Hellwig
2022-02-22 23:19         ` Jens Axboe
2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
2022-02-21  0:18   ` kernel test robot
2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox