* [PATCH v2 00/13] Support sync buffered writes for io-uring
@ 2022-02-18 19:57 Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
` (13 more replies)
0 siblings, 14 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This patch series adds support for async buffered writes. Currently
io-uring only supports buffered writes in the slow path, by processing
them in the io workers. With this patch series it is now possible to
support buffered writes in the fast path. To be able to use the fast
path the required pages must be in the page cache or they can be loaded
with noio. Otherwise they still get punted to the slow path.
If a buffered write request requires more than one page, it is possible
that only part of the request can use the fast path, the resst will be
completed by the io workers.
Support for async buffered writes:
Patch 1: fs: Add flags parameter to __block_write_begin_int
Add a flag parameter to the function __block_write_begin_int
to allow specifying a nowait parameter.
Patch 2: mm: Introduce do_generic_perform_write
Introduce a new do_generic_perform_write function. The function
is split off from the existing generic_perform_write() function.
It allows to specify an additional flag parameter. This parameter
is used to specify the nowait flag.
Patch 3: mm: Add support for async buffered writes
For async buffered writes allocate pages without blocking on the
allocation.
Patch 4: fs: split off __alloc_page_buffers function
Split off __alloc_page_buffers() function with new gfp_t parameter.
Patch 5: fs: split off __create_empty_buffers function
Split off __create_empty_buffers() function with new gfp_t parameter.
Patch 6: fs: Add gfp_t parameter to create_page_buffers()
Add gfp_t parameter to create_page_buffers() function. Use atomic
allocation for async buffered writes.
Patch 7: fs: add support for async buffered writes
Return -EAGAIN instead of -ENOMEM for async buffered writes. This
will cause the write request to be processed by an io worker.
Patch 8: io_uring: add support for async buffered writes
This enables the async buffered writes for block devices in io_uring.
Buffered writes are enabled for blocks that are already in the page
cache or can be acquired with noio.
Patch 9: io_uring: Add tracepoint for short writes
Support for write throttling of async buffered writes:
Patch 10: sched: add new fields to task_struct
Add two new fields to the task_struct. These fields store the
deadline after which writes are no longer throttled.
Patch 11: mm: support write throttling for async buffered writes
This changes the balance_dirty_pages function to take an additonal
parameter. When nowait is specified the write throttling code no
longer waits synchronously for the deadline to expire. Instead
it sets the fields in task_struct. Once the deadline expires the
fields are reset.
Patch 12: io_uring: support write throttling for async buffered writes
Adds support to io_uring for write throttling. When the writes
are throttled, the write requests are added to the pending io list.
Once the write throttling deadline expires, the writes are submitted.
Enable async buffered write support
Patch 13: fs: add flag to support async buffered writes
This sets the flags that enables async buffered writes for block
devices.
Testing:
This patch has been tested with xfstests and fio.
Peformance results:
For fio the following results have been obtained with a queue depth of
1 and 4k block size (runtime 600 secs):
sequential writes:
without patch with patch
throughput: 329 Mib/s 1032Mib/s
iops: 82k 264k
slat (nsec) 2332 3340
clat (nsec) 9017 60
CPU util%: 37% 78%
random writes:
without patch with patch
throughput: 307 Mib/s 909Mib/s
iops: 76k 227k
slat (nsec) 2419 3780
clat (nsec) 9934 59
CPU util%: 57% 88%
For an io depth of 1, the new patch improves throughput by close to 3
times and also the latency is considerably reduced. To achieve the same
or better performance with the exisiting code an io depth of 4 is required.
Especially for mixed workloads this is a considerable improvement.
Changes:
V2: - removed patch 3 from patch series 1
- replaced parameter aop_flags with with gfp_t in create_page_buffers()
- Moved gfp flags to callers of create_page_buffers()
- Removed changing of FGP_NOWAIT in __filemap_get_folio() and moved gfp
flags to caller of __filemap_get_folio()
- Renamed AOP_FLAGS_NOWAIT to AOP_FLAG_NOWAIT
Stefan Roesch (13):
fs: Add flags parameter to __block_write_begin_int
mm: Introduce do_generic_perform_write
mm: Add support for async buffered writes
fs: split off __alloc_page_buffers function
fs: split off __create_empty_buffers function
fs: Add gfp_t parameter to create_page_buffers()
fs: add support for async buffered writes
io_uring: add support for async buffered writes
io_uring: Add tracepoint for short writes
sched: add new fields to task_struct
mm: support write throttling for async buffered writes
io_uring: support write throttling for async buffered writes
block: enable async buffered writes for block devices.
block/fops.c | 5 +-
fs/buffer.c | 98 +++++++++++++++---------
fs/internal.h | 3 +-
fs/io_uring.c | 130 +++++++++++++++++++++++++++++---
fs/iomap/buffered-io.c | 4 +-
fs/read_write.c | 3 +-
include/linux/fs.h | 4 +
include/linux/sched.h | 3 +
include/linux/writeback.h | 1 +
include/trace/events/io_uring.h | 25 ++++++
kernel/fork.c | 1 +
mm/filemap.c | 23 ++++--
mm/folio-compat.c | 12 ++-
mm/page-writeback.c | 54 +++++++++----
14 files changed, 289 insertions(+), 77 deletions(-)
base-commit: 9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405
--
2.30.2
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:59 ` Matthew Wilcox
2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
` (12 subsequent siblings)
13 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds a flags parameter to the __begin_write_begin_int() function.
This allows to pass flags down the stack.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/buffer.c | 7 ++++---
fs/internal.h | 3 ++-
fs/iomap/buffered-io.c | 4 ++--
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 8e112b6bd371..6e6a69a12eed 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1970,7 +1970,8 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
}
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
- get_block_t *get_block, const struct iomap *iomap)
+ get_block_t *get_block, const struct iomap *iomap,
+ unsigned int flags)
{
unsigned from = pos & (PAGE_SIZE - 1);
unsigned to = from + len;
@@ -2058,7 +2059,7 @@ int __block_write_begin(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block)
{
return __block_write_begin_int(page_folio(page), pos, len, get_block,
- NULL);
+ NULL, 0);
}
EXPORT_SYMBOL(__block_write_begin);
@@ -2118,7 +2119,7 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
if (!page)
return -ENOMEM;
- status = __block_write_begin(page, pos, len, get_block);
+ status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, flags);
if (unlikely(status)) {
unlock_page(page);
put_page(page);
diff --git a/fs/internal.h b/fs/internal.h
index 8590c973c2f4..7432df23f3ce 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -38,7 +38,8 @@ static inline int emergency_thaw_bdev(struct super_block *sb)
* buffer.c
*/
int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
- get_block_t *get_block, const struct iomap *iomap);
+ get_block_t *get_block, const struct iomap *iomap,
+ unsigned int flags);
/*
* char_dev.c
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6c51a75d0be6..47c519952725 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -646,7 +646,7 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
if (srcmap->type == IOMAP_INLINE)
status = iomap_write_begin_inline(iter, folio);
else if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
- status = __block_write_begin_int(folio, pos, len, NULL, srcmap);
+ status = __block_write_begin_int(folio, pos, len, NULL, srcmap, 0);
else
status = __iomap_write_begin(iter, pos, len, folio);
@@ -979,7 +979,7 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD) {
ret = __block_write_begin_int(folio, iter->pos, length, NULL,
- &iter->iomap);
+ &iter->iomap, 0);
if (ret)
return ret;
block_commit_write(&folio->page, 0, length);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 02/13] mm: Introduce do_generic_perform_write
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
` (11 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This splits off the do generic_perform_write() function, so an
additional flags parameter can be specified. It uses the new flag
parameter to support async buffered writes.
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/fs.h | 1 +
mm/filemap.c | 20 +++++++++++++++-----
2 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2d892b201b0..b7dd5bd701c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -278,6 +278,7 @@ enum positive_aop_returns {
#define AOP_FLAG_NOFS 0x0002 /* used by filesystem to direct
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */
+#define AOP_FLAG_NOWAIT 0x0004 /* async nowait buffered writes */
/*
* oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index ad8c39d90bf9..5bd692a327d0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3725,14 +3725,13 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
}
EXPORT_SYMBOL(generic_file_direct_write);
-ssize_t generic_perform_write(struct file *file,
- struct iov_iter *i, loff_t pos)
+static ssize_t do_generic_perform_write(struct file *file, struct iov_iter *i,
+ loff_t pos, int flags)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
long status = 0;
ssize_t written = 0;
- unsigned int flags = 0;
do {
struct page *page;
@@ -3801,6 +3800,12 @@ ssize_t generic_perform_write(struct file *file,
return written ? written : status;
}
+
+ssize_t generic_perform_write(struct file *file,
+ struct iov_iter *i, loff_t pos)
+{
+ return do_generic_perform_write(file, i, pos, 0);
+}
EXPORT_SYMBOL(generic_perform_write);
/**
@@ -3832,6 +3837,10 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t written = 0;
ssize_t err;
ssize_t status;
+ int flags = 0;
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ flags |= AOP_FLAG_NOWAIT;
/* We can write back this queue in page reclaim */
current->backing_dev_info = inode_to_bdi(inode);
@@ -3857,7 +3866,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
goto out;
- status = generic_perform_write(file, from, pos = iocb->ki_pos);
+ status = do_generic_perform_write(file, from, pos = iocb->ki_pos, flags);
+
/*
* If generic_perform_write() returned a synchronous error
* then we want to return the number of bytes which were
@@ -3889,7 +3899,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
*/
}
} else {
- written = generic_perform_write(file, from, iocb->ki_pos);
+ written = do_generic_perform_write(file, from, iocb->ki_pos, flags);
if (likely(written > 0))
iocb->ki_pos += written;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 03/13] mm: Add support for async buffered writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
` (10 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds support for async buffered writes in the mm layer. When the
AOP_FLAG_NOWAIT flag is set, if the page is not already loaded,
the page gets created without blocking on the allocation.
Signed-off-by: Stefan Roesch <[email protected]>
---
mm/filemap.c | 1 +
mm/folio-compat.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5bd692a327d0..f4e2036c5029 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,6 +42,7 @@
#include <linux/ramfs.h>
#include <linux/page_idle.h>
#include <linux/migrate.h>
+#include <linux/sched/mm.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include "internal.h"
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 749555a232a8..8243eeb883c1 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -133,11 +133,19 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags)
{
unsigned fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE;
+ gfp_t gfp = mapping_gfp_mask(mapping);
if (flags & AOP_FLAG_NOFS)
fgp_flags |= FGP_NOFS;
- return pagecache_get_page(mapping, index, fgp_flags,
- mapping_gfp_mask(mapping));
+
+ if (flags & AOP_FLAG_NOWAIT) {
+ fgp_flags |= FGP_NOWAIT;
+
+ gfp |= GFP_ATOMIC;
+ gfp &= ~__GFP_DIRECT_RECLAIM;
+ }
+
+ return pagecache_get_page(mapping, index, fgp_flags, gfp);
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (2 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 20:42 ` Matthew Wilcox
2022-02-19 7:35 ` Christoph Hellwig
2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
` (9 subsequent siblings)
13 siblings, 2 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This splits off the __alloc_page_buffers() function from the
alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
the caller can specify the allocation flags.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/buffer.c | 37 ++++++++++++++++++++++---------------
1 file changed, 22 insertions(+), 15 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 6e6a69a12eed..2858eaf433c8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -802,26 +802,13 @@ int remove_inode_buffers(struct inode *inode)
return ret;
}
-/*
- * Create the appropriate buffers when given a page for data area and
- * the size of each buffer.. Use the bh->b_this_page linked list to
- * follow the buffers created. Return NULL if unable to create more
- * buffers.
- *
- * The retry flag is used to differentiate async IO (paging, swapping)
- * which may not fail from ordinary buffer allocations.
- */
-struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
- bool retry)
+static struct buffer_head *__alloc_page_buffers(struct page *page,
+ unsigned long size, gfp_t gfp)
{
struct buffer_head *bh, *head;
- gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
long offset;
struct mem_cgroup *memcg, *old_memcg;
- if (retry)
- gfp |= __GFP_NOFAIL;
-
/* The page lock pins the memcg */
memcg = page_memcg(page);
old_memcg = set_active_memcg(memcg);
@@ -859,6 +846,26 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
goto out;
}
+
+/*
+ * Create the appropriate buffers when given a page for data area and
+ * the size of each buffer.. Use the bh->b_this_page linked list to
+ * follow the buffers created. Return NULL if unable to create more
+ * buffers.
+ *
+ * The retry flag is used to differentiate async IO (paging, swapping)
+ * which may not fail from ordinary buffer allocations.
+ */
+struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
+ bool retry)
+{
+ gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
+
+ if (retry)
+ gfp |= __GFP_NOFAIL;
+
+ return __alloc_page_buffers(page, size, gfp);
+}
EXPORT_SYMBOL_GPL(alloc_page_buffers);
static inline void
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 05/13] fs: split off __create_empty_buffers function
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (3 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
` (8 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This splits off the function __create_empty_buffers() from the function
create_empty_buffers. The __create_empty_buffers has an additional gfp
parameter. This allows the caller to specify the allocation properties.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/buffer.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2858eaf433c8..648e1cba6da3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1554,17 +1554,12 @@ void block_invalidatepage(struct page *page, unsigned int offset,
EXPORT_SYMBOL(block_invalidatepage);
-/*
- * We attach and possibly dirty the buffers atomically wrt
- * __set_page_dirty_buffers() via private_lock. try_to_free_buffers
- * is already excluded via the page lock.
- */
-void create_empty_buffers(struct page *page,
- unsigned long blocksize, unsigned long b_state)
+static void __create_empty_buffers(struct page *page, unsigned long blocksize,
+ unsigned long b_state, gfp_t gfp)
{
struct buffer_head *bh, *head, *tail;
- head = alloc_page_buffers(page, blocksize, true);
+ head = __alloc_page_buffers(page, blocksize, gfp);
bh = head;
do {
bh->b_state |= b_state;
@@ -1587,6 +1582,17 @@ void create_empty_buffers(struct page *page,
attach_page_private(page, head);
spin_unlock(&page->mapping->private_lock);
}
+/*
+ * We attach and possibly dirty the buffers atomically wrt
+ * __set_page_dirty_buffers() via private_lock. try_to_free_buffers
+ * is already excluded via the page lock.
+ */
+void create_empty_buffers(struct page *page,
+ unsigned long blocksize, unsigned long b_state)
+{
+ return __create_empty_buffers(page, blocksize, b_state,
+ GFP_NOFS | __GFP_ACCOUNT | __GFP_NOFAIL);
+}
EXPORT_SYMBOL(create_empty_buffers);
/**
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers()
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (4 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-21 0:18 ` kernel test robot
2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
` (7 subsequent siblings)
13 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds the gfp_t parameter to the create_page_buffers function.
This allows the caller to specify the required parameters.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/buffer.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 648e1cba6da3..ae588ae4b1c1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1682,13 +1682,20 @@ static inline int block_size_bits(unsigned int blocksize)
return ilog2(blocksize);
}
-static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
+static struct buffer_head *create_page_buffers(struct page *page,
+ struct inode *inode,
+ unsigned int b_state,
+ gfp_t flags)
{
BUG_ON(!PageLocked(page));
- if (!page_has_buffers(page))
- create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
- b_state);
+ if (!page_has_buffers(page)) {
+ gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT | flags;
+
+ __create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
+ b_state, gfp);
+ }
+
return page_buffers(page);
}
@@ -1734,7 +1741,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
int write_flags = wbc_to_write_flags(wbc);
head = create_page_buffers(page, inode,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
+ (1 << BH_Dirty)|(1 << BH_Uptodate), __GFP_NOFAIL);
/*
* Be very careful. We have no exclusion from __set_page_dirty_buffers
@@ -2000,7 +2007,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
BUG_ON(to > PAGE_SIZE);
BUG_ON(from > to);
- head = create_page_buffers(&folio->page, inode, 0);
+ head = create_page_buffers(&folio->page, inode, 0, flags);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
@@ -2127,12 +2134,17 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
pgoff_t index = pos >> PAGE_SHIFT;
struct page *page;
int status;
+ gfp_t gfp = 0;
+ bool no_wait = (flags & AOP_FLAG_NOWAIT);
+
+ if (no_wait)
+ gfp = GFP_ATOMIC | __GFP_NOWARN;
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
- status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, flags);
+ status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
if (unlikely(status)) {
unlock_page(page);
put_page(page);
@@ -2280,7 +2292,7 @@ int block_read_full_page(struct page *page, get_block_t *get_block)
int nr, i;
int fully_mapped = 1;
- head = create_page_buffers(page, inode, 0);
+ head = create_page_buffers(page, inode, 0, __GFP_NOFAIL);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 07/13] fs: add support for async buffered writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (5 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
` (6 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds support for the AOP_FLAGS_BUF_WASYNC flag to the fs layer. If
a page that is required for writing is not in the page cache, it returns
EAGAIN instead of ENOMEM.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/buffer.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index ae588ae4b1c1..58331ef214b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2062,6 +2062,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
*wait_bh++=bh;
}
}
+
/*
* If we issued read requests - let them complete.
*/
@@ -2141,8 +2142,11 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
gfp = GFP_ATOMIC | __GFP_NOWARN;
page = grab_cache_page_write_begin(mapping, index, flags);
- if (!page)
+ if (!page) {
+ if (no_wait)
+ return -EAGAIN;
return -ENOMEM;
+ }
status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
if (unlikely(status)) {
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 08/13] io_uring: add support for async buffered writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (6 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
` (5 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This enables the async buffered writes for block devices in io_uring.
Buffered writes are enabled for blocks that are already in the page
cache or can be acquired with noio.
It is possible that a write request cannot be completely fullfilled
(short write). In that case the request is punted and sent to the io
workers to be completed. Before submitting the request to the io
workers, the request is updated with how much has already been written.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b9c7e4793b..52bd88908afd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3615,7 +3615,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
return -EINVAL;
}
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
{
return req->flags & REQ_F_ISREG ||
S_ISBLK(file_inode(req->file)->i_mode);
@@ -3679,7 +3679,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
} else if (ret == -EIOCBQUEUED) {
goto out_free;
} else if (ret == req->result || ret <= 0 || !force_nonblock ||
- (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+ (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
/* read all, failed, already did sync or don't want to retry */
goto done;
}
@@ -3777,9 +3777,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!io_file_supports_nowait(req)))
goto copy_iov;
- /* file path doesn't support NOWAIT for non-direct_IO */
- if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
- (req->flags & REQ_F_ISREG))
+ /* File path supports NOWAIT for non-direct_IO only for block devices. */
+ if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+ !(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+ (req->flags & REQ_F_ISREG))
goto copy_iov;
kiocb->ki_flags |= IOCB_NOWAIT;
@@ -3831,6 +3832,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
/* IOPOLL retry should happen for io-wq threads */
if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
goto copy_iov;
+
+ if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
+ struct io_async_rw *rw;
+
+ /* This is a partial write. The file pos has already been
+ * updated, setup the async struct to complete the request
+ * in the worker. Also update bytes_done to account for
+ * the bytes already written.
+ */
+ iov_iter_save_state(&s->iter, &s->iter_state);
+ ret = io_setup_async_rw(req, iovec, s, true);
+
+ rw = req->async_data;
+ if (rw)
+ rw->bytes_done += ret2;
+
+ return ret ? ret : -EAGAIN;
+ }
done:
kiocb_done(req, ret2, issue_flags);
} else {
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 09/13] io_uring: Add tracepoint for short writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (7 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
` (4 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 3 +++
include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 52bd88908afd..792ca4b6834d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3836,6 +3836,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
struct io_async_rw *rw;
+ trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+ req->result, ret2);
+
/* This is a partial write. The file pos has already been
* updated, setup the async struct to complete the request
* in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..ce1cfdf4b015 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -558,6 +558,31 @@ TRACE_EVENT(io_uring_req_failed,
(unsigned long long) __entry->pad2, __entry->error)
);
+TRACE_EVENT(io_uring_short_write,
+
+ TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+ TP_ARGS(ctx, fpos, wanted, got),
+
+ TP_STRUCT__entry(
+ __field(void *, ctx)
+ __field(u64, fpos)
+ __field(u64, wanted)
+ __field(u64, got)
+ ),
+
+ TP_fast_assign(
+ __entry->ctx = ctx;
+ __entry->fpos = fpos;
+ __entry->wanted = wanted;
+ __entry->got = got;
+ ),
+
+ TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+ __entry->ctx, __entry->fpos,
+ __entry->wanted, __entry->got)
+);
+
#endif /* _TRACE_IO_URING_H */
/* This part must be outside protection */
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 10/13] sched: add new fields to task_struct
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (8 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
` (3 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
Add two new fields to the task_struct to support async
write throttling.
- One field to store how long writes are throttled: bdp_pause
- The other field to store the number of dirtied pages:
bdp_nr_dirtied_pause
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/fork.c | 1 +
2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..97146b7539c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1324,6 +1324,9 @@ struct task_struct {
/* Start of a write-and-pause period: */
unsigned long dirty_paused_when;
+ unsigned long bdp_pause;
+ int bdp_nr_dirtied_pause;
+
#ifdef CONFIG_LATENCYTOP
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..d34c9c00baea 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2246,6 +2246,7 @@ static __latent_entropy struct task_struct *copy_process(
p->nr_dirtied = 0;
p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
p->dirty_paused_when = 0;
+ p->bdp_nr_dirtied_pause = -1;
p->pdeath_signal = 0;
INIT_LIST_HEAD(&p->thread_group);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 11/13] mm: support write throttling for async buffered writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (9 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
` (2 subsequent siblings)
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This change adds support for async write throttling in the function
balance_dirty_pages(). So far if throttling was required, the code
was waiting synchronously as long as the writes were throttled. This
change introduces asynchronous throttling. Instead of waiting in the
function balance_dirty_pages(), the timeout is set in the task_struct
field bdp_pause. Once the timeout has expired, the writes are no
longer throttled.
- Add a new parameter to the balance_dirty_pages() function
- This allows the caller to pass in the nowait flag
- When the nowait flag is specified, the code does not wait in
balance_dirty_pages(), but instead stores the wait expiration in the
new task_struct field bdp_pause.
- The function balance_dirty_pages_ratelimited() resets the new values
in the task_struct, once the timeout has expired
This change is required to support write throttling for the async
buffered writes. While the writes are throttled, io_uring still can make
progress with processing other requests.
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/writeback.h | 1 +
mm/filemap.c | 2 +-
mm/page-writeback.c | 54 ++++++++++++++++++++++++++++-----------
3 files changed, 41 insertions(+), 16 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..48176a8047db 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -373,6 +373,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
void wb_update_bandwidth(struct bdi_writeback *wb);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async);
bool wb_over_bg_thresh(struct bdi_writeback *wb);
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/filemap.c b/mm/filemap.c
index f4e2036c5029..642a4e814869 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3796,7 +3796,7 @@ static ssize_t do_generic_perform_write(struct file *file, struct iov_iter *i,
pos += status;
written += status;
- balance_dirty_pages_ratelimited(mapping);
+ balance_dirty_pages_ratelimited_flags(mapping, flags & AOP_FLAG_NOWAIT);
} while (iov_iter_count(i));
return written ? written : status;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 91d163f8d36b..767d0b997da5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1558,7 +1558,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
* perform some writeout.
*/
static void balance_dirty_pages(struct bdi_writeback *wb,
- unsigned long pages_dirtied)
+ unsigned long pages_dirtied, bool is_async)
{
struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1792,6 +1792,14 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
period,
pause,
start_time);
+ if (is_async) {
+ if (current->bdp_nr_dirtied_pause == -1) {
+ current->bdp_pause = now + pause;
+ current->bdp_nr_dirtied_pause = nr_dirtied_pause;
+ }
+ break;
+ }
+
__set_current_state(TASK_KILLABLE);
wb->dirty_sleep = now;
io_schedule_timeout(pause);
@@ -1799,6 +1807,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
current->nr_dirtied_pause = nr_dirtied_pause;
+ current->bdp_nr_dirtied_pause = -1;
+ current->bdp_pause = 0;
/*
* This is typically equal to (dirty < thresh) and can also
@@ -1863,19 +1873,7 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
*/
DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied. The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -1886,6 +1884,15 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
return;
+ if (current->bdp_nr_dirtied_pause != -1 && time_after(jiffies, current->bdp_pause)) {
+ current->dirty_paused_when = current->bdp_pause;
+ current->nr_dirtied = 0;
+ current->nr_dirtied_pause = current->bdp_nr_dirtied_pause;
+
+ current->bdp_nr_dirtied_pause = -1;
+ current->bdp_pause = 0;
+ }
+
if (inode_cgwb_enabled(inode))
wb = wb_get_create_current(bdi, GFP_KERNEL);
if (!wb)
@@ -1924,10 +1931,27 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
preempt_enable();
if (unlikely(current->nr_dirtied >= ratelimit))
- balance_dirty_pages(wb, current->nr_dirtied);
+ balance_dirty_pages(wb, current->nr_dirtied, is_async);
wb_put(wb);
}
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied. The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+ balance_dirty_pages_ratelimited_flags(mapping, false);
+}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
/**
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 12/13] io_uring: support write throttling for async buffered writes
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (10 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This adds the process-level throttling for the block layer for async
buffered writes to io-uring.In io_write the code now checks if the write
needs to be throttled. If this is required, it adds the request to the
list of pending io requests and starts a timer. After the timer expires,
it submits the list of pending writes.
- Add new list called pending_ios for delayed writes (throttled writes)
to struct io_uring_task. The list is protected by the task_lock spin
lock.
- Add new timer to struct io_uring_task.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 91 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 792ca4b6834d..8a48e5ee4e5e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -461,6 +461,11 @@ struct io_ring_ctx {
};
};
+struct pending_list {
+ struct list_head list;
+ struct io_kiocb *req;
+};
+
struct io_uring_task {
/* submission side */
int cached_refs;
@@ -477,6 +482,9 @@ struct io_uring_task {
struct io_wq_work_list prior_task_list;
struct callback_head task_work;
bool task_running;
+
+ struct pending_list pending_ios;
+ struct timer_list timer;
};
/*
@@ -1134,13 +1142,14 @@ static void io_rsrc_put_work(struct work_struct *work);
static void io_req_task_queue(struct io_kiocb *req);
static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
-static int io_req_prep_async(struct io_kiocb *req);
+static int io_req_prep_async(struct io_kiocb *req, bool force);
static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
unsigned int issue_flags, u32 slot_index);
static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
+static void delayed_write_fn(struct timer_list *tmr);
static struct kmem_cache *req_cachep;
@@ -2462,6 +2471,31 @@ static void io_req_task_queue_reissue(struct io_kiocb *req)
io_req_task_work_add(req, false);
}
+static int io_req_task_queue_reissue_delayed(struct io_kiocb *req)
+{
+ struct io_uring_task *tctx = req->task->io_uring;
+ struct pending_list *pending = kmalloc(sizeof(struct pending_list), GFP_KERNEL);
+ bool empty;
+
+ if (!pending)
+ return -ENOMEM;
+ pending->req = req;
+
+ spin_lock_irq(&tctx->task_lock);
+ empty = list_empty(&tctx->pending_ios.list);
+ list_add_tail(&pending->list, &tctx->pending_ios.list);
+
+ if (empty) {
+ timer_setup(&tctx->timer, delayed_write_fn, 0);
+
+ tctx->timer.expires = current->bdp_pause;
+ add_timer(&tctx->timer);
+ }
+ spin_unlock_irq(&tctx->task_lock);
+
+ return 0;
+}
+
static inline void io_queue_next(struct io_kiocb *req)
{
struct io_kiocb *nxt = io_req_find_next(req);
@@ -2770,7 +2804,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
struct io_async_rw *rw = req->async_data;
if (!req_has_async_data(req))
- return !io_req_prep_async(req);
+ return !io_req_prep_async(req, false);
iov_iter_restore(&rw->s.iter, &rw->s.iter_state);
return true;
}
@@ -3751,6 +3785,38 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_prep_rw(req, sqe);
}
+static inline unsigned long write_delay(void)
+{
+ if (likely(current->bdp_nr_dirtied_pause == -1 ||
+ !time_before(jiffies, current->bdp_pause)))
+ return 0;
+
+ return current->bdp_pause;
+}
+
+static void delayed_write_fn(struct timer_list *tmr)
+{
+ struct io_uring_task *tctx = from_timer(tctx, tmr, timer);
+ struct list_head *curr;
+ struct list_head *next;
+ LIST_HEAD(pending_ios);
+
+ /* Move list to temporary list. */
+ spin_lock_irq(&tctx->task_lock);
+ list_splice_init(&tctx->pending_ios.list, &pending_ios);
+ spin_unlock_irq(&tctx->task_lock);
+
+ list_for_each_safe(curr, next, &pending_ios) {
+ struct pending_list *io;
+
+ io = list_entry(curr, struct pending_list, list);
+ io_req_task_queue_reissue(io->req);
+
+ list_del(curr);
+ kfree(io);
+ }
+}
+
static int io_write(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rw_state __s, *s = &__s;
@@ -3759,6 +3825,18 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
ssize_t ret, ret2;
+ /* Write throttling active? */
+ if (unlikely(write_delay()) && !(kiocb->ki_flags & IOCB_DIRECT)) {
+ int ret = io_req_prep_async(req, true);
+
+ if (unlikely(ret))
+ io_req_complete_failed(req, ret);
+ else
+ ret = io_req_task_queue_reissue_delayed(req);
+
+ return ret;
+ }
+
if (!req_has_async_data(req)) {
ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags);
if (unlikely(ret < 0))
@@ -6596,9 +6674,9 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return -EINVAL;
}
-static int io_req_prep_async(struct io_kiocb *req)
+static int io_req_prep_async(struct io_kiocb *req, bool force)
{
- if (!io_op_defs[req->opcode].needs_async_setup)
+ if (!force && !io_op_defs[req->opcode].needs_async_setup)
return 0;
if (WARN_ON_ONCE(req_has_async_data(req)))
return -EFAULT;
@@ -6608,6 +6686,10 @@ static int io_req_prep_async(struct io_kiocb *req)
switch (req->opcode) {
case IORING_OP_READV:
return io_rw_prep_async(req, READ);
+ case IORING_OP_WRITE:
+ if (!force)
+ break;
+ fallthrough;
case IORING_OP_WRITEV:
return io_rw_prep_async(req, WRITE);
case IORING_OP_SENDMSG:
@@ -6617,6 +6699,7 @@ static int io_req_prep_async(struct io_kiocb *req)
case IORING_OP_CONNECT:
return io_connect_prep_async(req);
}
+
printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
req->opcode);
return -EFAULT;
@@ -6650,7 +6733,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
}
spin_unlock(&ctx->completion_lock);
- ret = io_req_prep_async(req);
+ ret = io_req_prep_async(req, false);
if (ret) {
fail:
io_req_complete_failed(req, ret);
@@ -7145,7 +7228,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
} else if (unlikely(req->ctx->drain_active)) {
io_drain_req(req);
} else {
- int ret = io_req_prep_async(req);
+ int ret = io_req_prep_async(req, false);
if (unlikely(ret))
io_req_complete_failed(req, ret);
@@ -7344,7 +7427,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
struct io_kiocb *head = link->head;
if (!(req->flags & REQ_F_FAIL)) {
- ret = io_req_prep_async(req);
+ ret = io_req_prep_async(req, false);
if (unlikely(ret)) {
req_fail_link_node(req, ret);
if (!(head->flags & REQ_F_FAIL))
@@ -8784,6 +8867,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
INIT_WQ_LIST(&tctx->task_list);
INIT_WQ_LIST(&tctx->prior_task_list);
init_task_work(&tctx->task_work, tctx_task_work);
+ INIT_LIST_HEAD(&tctx->pending_ios.list);
return 0;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v2 13/13] block: enable async buffered writes for block devices.
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (11 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
@ 2022-02-18 19:57 ` Stefan Roesch
2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner
13 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 19:57 UTC (permalink / raw)
To: io-uring, linux-fsdevel, linux-block, kernel-team; +Cc: shr
This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also enables async buffered
writes for block devices.
Signed-off-by: Stefan Roesch <[email protected]>
---
block/fops.c | 5 +----
fs/read_write.c | 3 ++-
include/linux/fs.h | 3 +++
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index 4f59e0f5bf30..75b36f8b5e71 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -489,7 +489,7 @@ static int blkdev_open(struct inode *inode, struct file *filp)
* during an unstable branch.
*/
filp->f_flags |= O_LARGEFILE;
- filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+ filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
if (filp->f_flags & O_NDELAY)
filp->f_mode |= FMODE_NDELAY;
@@ -544,9 +544,6 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_pos >= size)
return -ENOSPC;
- if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
- return -EOPNOTSUPP;
-
size -= iocb->ki_pos;
if (iov_iter_count(from) > size) {
shorted = iov_iter_count(from) - size;
diff --git a/fs/read_write.c b/fs/read_write.c
index 0074afa7ecb3..58233844a9d8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1641,7 +1641,8 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
- if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+ if ((iocb->ki_flags & IOCB_NOWAIT) &&
+ (!(iocb->ki_flags & IOCB_DIRECT) && !(file->f_mode & FMODE_BUF_WASYNC)))
return -EINVAL;
count = iov_iter_count(from);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b7dd5bd701c0..a526de71b932 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -176,6 +176,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports async buffered reads */
#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000)
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC ((__force fmode_t)0x80000000)
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
@ 2022-02-18 19:59 ` Matthew Wilcox
2022-02-18 20:08 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 19:59 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> This adds a flags parameter to the __begin_write_begin_int() function.
> This allows to pass flags down the stack.
Still no.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 19:59 ` Matthew Wilcox
@ 2022-02-18 20:08 ` Stefan Roesch
2022-02-18 20:13 ` Matthew Wilcox
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:08 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>> This adds a flags parameter to the __begin_write_begin_int() function.
>> This allows to pass flags down the stack.
>
> Still no.
Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
first have a patch that replaces the existing aop_flag parameter with the gfp_t?
and then modify this patch to directly use gfp flags?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:08 ` Stefan Roesch
@ 2022-02-18 20:13 ` Matthew Wilcox
2022-02-18 20:14 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:13 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>
>
> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >> This adds a flags parameter to the __begin_write_begin_int() function.
> >> This allows to pass flags down the stack.
> >
> > Still no.
>
> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
There is no function by that name in Linus' tree.
> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> and then modify this patch to directly use gfp flags?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:13 ` Matthew Wilcox
@ 2022-02-18 20:14 ` Stefan Roesch
2022-02-18 20:22 ` Matthew Wilcox
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:14 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>> This allows to pass flags down the stack.
>>>
>>> Still no.
>>
>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
>
> There is no function by that name in Linus' tree.
>
>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>> and then modify this patch to directly use gfp flags?
s/block_begin_write_cache/block_write_begin/
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:14 ` Stefan Roesch
@ 2022-02-18 20:22 ` Matthew Wilcox
2022-02-18 20:25 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:22 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
>
>
> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
> >>
> >>
> >> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> >>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >>>> This adds a flags parameter to the __begin_write_begin_int() function.
> >>>> This allows to pass flags down the stack.
> >>>
> >>> Still no.
> >>
> >> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
> >
> > There is no function by that name in Linus' tree.
> >
> >> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> >> and then modify this patch to directly use gfp flags?
>
> s/block_begin_write_cache/block_write_begin/
I don't think there's any need to change the arguments to
block_write_begin(). That's widely used and I don't think changing
all the users is worth it. You don't seem to call it anywhere in this
patch set.
But having block_write_begin() translate the aop flags into gfp
and fgp flags, yes. It can call pagecache_get_page() instead of
grab_cache_page_write_begin(). And then you don't need to change
grab_cache_page_write_begin() at all.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:22 ` Matthew Wilcox
@ 2022-02-18 20:25 ` Stefan Roesch
2022-02-18 20:35 ` Matthew Wilcox
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:25 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On 2/18/22 12:22 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>>>
>>>>
>>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>>>> This allows to pass flags down the stack.
>>>>>
>>>>> Still no.
>>>>
>>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
>>>
>>> There is no function by that name in Linus' tree.
>>>
>>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>>>> and then modify this patch to directly use gfp flags?
>>
>> s/block_begin_write_cache/block_write_begin/
>
> I don't think there's any need to change the arguments to
> block_write_begin(). That's widely used and I don't think changing
> all the users is worth it. You don't seem to call it anywhere in this
> patch set.
>
> But having block_write_begin() translate the aop flags into gfp
> and fgp flags, yes. It can call pagecache_get_page() instead of
> grab_cache_page_write_begin(). And then you don't need to change
> grab_cache_page_write_begin() at all.
That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
You are ok with that?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:25 ` Stefan Roesch
@ 2022-02-18 20:35 ` Matthew Wilcox
2022-02-18 20:39 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:35 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 12:25:41PM -0800, Stefan Roesch wrote:
>
>
> On 2/18/22 12:22 PM, Matthew Wilcox wrote:
> > On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
> >>
> >>
> >> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
> >>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
> >>>>
> >>>>
> >>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
> >>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
> >>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
> >>>>>> This allows to pass flags down the stack.
> >>>>>
> >>>>> Still no.
> >>>>
> >>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
> >>>
> >>> There is no function by that name in Linus' tree.
> >>>
> >>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
> >>>> and then modify this patch to directly use gfp flags?
> >>
> >> s/block_begin_write_cache/block_write_begin/
> >
> > I don't think there's any need to change the arguments to
> > block_write_begin(). That's widely used and I don't think changing
> > all the users is worth it. You don't seem to call it anywhere in this
> > patch set.
> >
> > But having block_write_begin() translate the aop flags into gfp
> > and fgp flags, yes. It can call pagecache_get_page() instead of
> > grab_cache_page_write_begin(). And then you don't need to change
> > grab_cache_page_write_begin() at all.
>
> That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
> You are ok with that?
No new AOP_FLAG. block_write_begin() does not get called with
AOP_FLAG_NOWAIT in this series. You'd want to pass gfp flags to
__block_write_begin_int instead of aop flags.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int
2022-02-18 20:35 ` Matthew Wilcox
@ 2022-02-18 20:39 ` Stefan Roesch
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:39 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On 2/18/22 12:35 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 12:25:41PM -0800, Stefan Roesch wrote:
>>
>>
>> On 2/18/22 12:22 PM, Matthew Wilcox wrote:
>>> On Fri, Feb 18, 2022 at 12:14:50PM -0800, Stefan Roesch wrote:
>>>>
>>>>
>>>> On 2/18/22 12:13 PM, Matthew Wilcox wrote:
>>>>> On Fri, Feb 18, 2022 at 12:08:27PM -0800, Stefan Roesch wrote:
>>>>>>
>>>>>>
>>>>>> On 2/18/22 11:59 AM, Matthew Wilcox wrote:
>>>>>>> On Fri, Feb 18, 2022 at 11:57:27AM -0800, Stefan Roesch wrote:
>>>>>>>> This adds a flags parameter to the __begin_write_begin_int() function.
>>>>>>>> This allows to pass flags down the stack.
>>>>>>>
>>>>>>> Still no.
>>>>>>
>>>>>> Currently block_begin_write_cache is expecting an aop_flag. Are you asking to
>>>>>
>>>>> There is no function by that name in Linus' tree.
>>>>>
>>>>>> first have a patch that replaces the existing aop_flag parameter with the gfp_t?
>>>>>> and then modify this patch to directly use gfp flags?
>>>>
>>>> s/block_begin_write_cache/block_write_begin/
>>>
>>> I don't think there's any need to change the arguments to
>>> block_write_begin(). That's widely used and I don't think changing
>>> all the users is worth it. You don't seem to call it anywhere in this
>>> patch set.
>>>
>>> But having block_write_begin() translate the aop flags into gfp
>>> and fgp flags, yes. It can call pagecache_get_page() instead of
>>> grab_cache_page_write_begin(). And then you don't need to change
>>> grab_cache_page_write_begin() at all.
>>
>> That would still require adding a new aop flag (AOP_FLAG_NOWAIT).
>> You are ok with that?
>
> No new AOP_FLAG. block_write_begin() does not get called with
> AOP_FLAG_NOWAIT in this series. You'd want to pass gfp flags to
> __block_write_begin_int instead of aop flags.
v2 of the patch series is using AOP_FLAG_NOWAIT in block_write_begin().
Without introducing a new aop flag, how would I know in block_write_begin()
that the request is a nowait async buffered write?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
@ 2022-02-18 20:42 ` Matthew Wilcox
2022-02-18 20:50 ` Stefan Roesch
2022-02-19 7:35 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-18 20:42 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 11:57:30AM -0800, Stefan Roesch wrote:
> This splits off the __alloc_page_buffers() function from the
> alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
> the caller can specify the allocation flags.
This one only has six callers, so let's get the API right. I suggest
making this:
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
gfp_t gfp)
{
gfp |= __GFP_ACCOUNT;
and then all the existing callers specify either GFP_NOFS or
GFP_NOFS | __GFP_NOFAIL.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-18 20:42 ` Matthew Wilcox
@ 2022-02-18 20:50 ` Stefan Roesch
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-02-18 20:50 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On 2/18/22 12:42 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:57:30AM -0800, Stefan Roesch wrote:
>> This splits off the __alloc_page_buffers() function from the
>> alloc_page_buffers_function(). In addition it adds a gfp_t parameter, so
>> the caller can specify the allocation flags.
>
> This one only has six callers, so let's get the API right. I suggest
> making this:
>
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> gfp_t gfp)
> {
> gfp |= __GFP_ACCOUNT;
>
> and then all the existing callers specify either GFP_NOFS or
> GFP_NOFS | __GFP_NOFAIL.
>
I can make that change, but i don't see how i can decide in block_write_begin()
to use different gfp flags when an async buffered write request is processed?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
2022-02-18 20:42 ` Matthew Wilcox
@ 2022-02-19 7:35 ` Christoph Hellwig
2022-02-20 4:23 ` Matthew Wilcox
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2022-02-19 7:35 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
Err, hell no. Please do not add any new functionality to the legacy
buffer head code. If you want new features do that on the
non-bufferhead iomap code path only please.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-19 7:35 ` Christoph Hellwig
@ 2022-02-20 4:23 ` Matthew Wilcox
2022-02-20 4:38 ` Jens Axboe
2022-02-22 8:18 ` Christoph Hellwig
0 siblings, 2 replies; 32+ messages in thread
From: Matthew Wilcox @ 2022-02-20 4:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
> Err, hell no. Please do not add any new functionality to the legacy
> buffer head code. If you want new features do that on the
> non-bufferhead iomap code path only please.
I think "first convert the block device code from buffer_heads to iomap"
might be a bit much of a prerequisite. I think running ext4 on top of a
block device still requires buffer_heads, for example (I tried to convert
the block device to use mpage in order to avoid creating buffer_heads
when possible, and ext4 stopped working. I didn't try too hard to debug
it as it was a bit of a distraction at the time).
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-20 4:23 ` Matthew Wilcox
@ 2022-02-20 4:38 ` Jens Axboe
2022-02-20 4:51 ` Jens Axboe
2022-02-22 8:18 ` Christoph Hellwig
1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-02-20 4:38 UTC (permalink / raw)
To: Matthew Wilcox, Christoph Hellwig
Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
On 2/19/22 9:23 PM, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>> Err, hell no. Please do not add any new functionality to the legacy
>> buffer head code. If you want new features do that on the
>> non-bufferhead iomap code path only please.
>
> I think "first convert the block device code from buffer_heads to
> iomap" might be a bit much of a prerequisite. I think running ext4 on
> top of a
Yes, that's exactly what Christoph was trying to say, but failing to
state in an appropriate manner. And we did actually discuss that, I'm
not against doing something like that.
> block device still requires buffer_heads, for example (I tried to convert
> the block device to use mpage in order to avoid creating buffer_heads
> when possible, and ext4 stopped working. I didn't try too hard to debug
> it as it was a bit of a distraction at the time).
That's one of the main reasons why I didn't push this particular path,
as it is a bit fraught with weirdness and legacy buffer_head code which
isn't that easy to tackle...
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-20 4:38 ` Jens Axboe
@ 2022-02-20 4:51 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2022-02-20 4:51 UTC (permalink / raw)
To: Matthew Wilcox, Christoph Hellwig
Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
On 2/19/22 9:38 PM, Jens Axboe wrote:
> On 2/19/22 9:23 PM, Matthew Wilcox wrote:
>> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>>> Err, hell no. Please do not add any new functionality to the legacy
>>> buffer head code. If you want new features do that on the
>>> non-bufferhead iomap code path only please.
>>
>> I think "first convert the block device code from buffer_heads to
>> iomap" might be a bit much of a prerequisite. I think running ext4 on
>> top of a
>
> Yes, that's exactly what Christoph was trying to say, but failing to
> state in an appropriate manner. And we did actually discuss that, I'm
> not against doing something like that.
Just to be clear, I do agree with you that it's an unfair ask for this
change. And as you mentioned, ext4 would require the buffer_head code
to be touched anyway, just layering on top of the necessary changes
for the bdev code.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 00/13] Support sync buffered writes for io-uring
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
` (12 preceding siblings ...)
2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
@ 2022-02-20 22:38 ` Dave Chinner
13 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2022-02-20 22:38 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, linux-fsdevel, linux-block, kernel-team
On Fri, Feb 18, 2022 at 11:57:26AM -0800, Stefan Roesch wrote:
> This patch series adds support for async buffered writes. Currently
> io-uring only supports buffered writes in the slow path, by processing
> them in the io workers. With this patch series it is now possible to
> support buffered writes in the fast path. To be able to use the fast
> path the required pages must be in the page cache or they can be loaded
> with noio. Otherwise they still get punted to the slow path.
Where's the filesystem support? You need to plumb in ext4 to this
bufferhead support, and add iomap/xfs support as well so we can
shake out all the problems with APIs and fallback paths that are
needed for full support of buffered writes via io_uring.
> If a buffered write request requires more than one page, it is possible
> that only part of the request can use the fast path, the resst will be
> completed by the io workers.
That's ugly, especially at the filesystem/iomap layer where we are
doing delayed allocation and so partial writes like this could have
significant extra impact. It opens up the possibility of things like
ENOSPC/EDQUOT mid-way through the write instead of being an up-front
error, and so there's lots more complexity in the failure/fallback
paths that the io_uring infrastructure will have to handle
correctly...
Also, it breaks the "atomic buffered write" design of iomap/XFS
where other readers and writers will only see whole completed writes
and not intermediate partial writes. This is where a lot of the bugs
in the DIO io_uring support were found (deadlocks, data corruptions,
etc), so there's a bunch of semantic and API issues that filesystems
require from io_uring that need to be sorted out before we think
about merge buffered write support...
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers()
2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
@ 2022-02-21 0:18 ` kernel test robot
0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2022-02-21 0:18 UTC (permalink / raw)
To: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
Cc: kbuild-all, shr
Hi Stefan,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on 9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405]
url: https://github.com/0day-ci/linux/commits/Stefan-Roesch/Support-sync-buffered-writes-for-io-uring/20220220-172629
base: 9195e5e0adbb8a9a5ee9ef0f9dedf6340d827405
config: sparc64-randconfig-s031-20220220 (https://download.01.org/0day-ci/archive/20220221/[email protected]/config)
compiler: sparc64-linux-gcc (GCC) 11.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/e98a7c2a17960f81efc5968cbc386af7c088a8ed
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Stefan-Roesch/Support-sync-buffered-writes-for-io-uring/20220220-172629
git checkout e98a7c2a17960f81efc5968cbc386af7c088a8ed
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sparc64 SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
sparse warnings: (new ones prefixed by >>)
>> fs/buffer.c:2010:60: sparse: sparse: incorrect type in argument 4 (different base types) @@ expected restricted gfp_t [usertype] flags @@ got unsigned int flags @@
fs/buffer.c:2010:60: sparse: expected restricted gfp_t [usertype] flags
fs/buffer.c:2010:60: sparse: got unsigned int flags
>> fs/buffer.c:2147:87: sparse: sparse: incorrect type in argument 6 (different base types) @@ expected unsigned int flags @@ got restricted gfp_t [assigned] [usertype] gfp @@
fs/buffer.c:2147:87: sparse: expected unsigned int flags
fs/buffer.c:2147:87: sparse: got restricted gfp_t [assigned] [usertype] gfp
vim +2010 fs/buffer.c
1991
1992 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
1993 get_block_t *get_block, const struct iomap *iomap,
1994 unsigned int flags)
1995 {
1996 unsigned from = pos & (PAGE_SIZE - 1);
1997 unsigned to = from + len;
1998 struct inode *inode = folio->mapping->host;
1999 unsigned block_start, block_end;
2000 sector_t block;
2001 int err = 0;
2002 unsigned blocksize, bbits;
2003 struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
2004
2005 BUG_ON(!folio_test_locked(folio));
2006 BUG_ON(from > PAGE_SIZE);
2007 BUG_ON(to > PAGE_SIZE);
2008 BUG_ON(from > to);
2009
> 2010 head = create_page_buffers(&folio->page, inode, 0, flags);
2011 blocksize = head->b_size;
2012 bbits = block_size_bits(blocksize);
2013
2014 block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
2015
2016 for(bh = head, block_start = 0; bh != head || !block_start;
2017 block++, block_start=block_end, bh = bh->b_this_page) {
2018 block_end = block_start + blocksize;
2019 if (block_end <= from || block_start >= to) {
2020 if (folio_test_uptodate(folio)) {
2021 if (!buffer_uptodate(bh))
2022 set_buffer_uptodate(bh);
2023 }
2024 continue;
2025 }
2026 if (buffer_new(bh))
2027 clear_buffer_new(bh);
2028 if (!buffer_mapped(bh)) {
2029 WARN_ON(bh->b_size != blocksize);
2030 if (get_block) {
2031 err = get_block(inode, block, bh, 1);
2032 if (err)
2033 break;
2034 } else {
2035 iomap_to_bh(inode, block, bh, iomap);
2036 }
2037
2038 if (buffer_new(bh)) {
2039 clean_bdev_bh_alias(bh);
2040 if (folio_test_uptodate(folio)) {
2041 clear_buffer_new(bh);
2042 set_buffer_uptodate(bh);
2043 mark_buffer_dirty(bh);
2044 continue;
2045 }
2046 if (block_end > to || block_start < from)
2047 folio_zero_segments(folio,
2048 to, block_end,
2049 block_start, from);
2050 continue;
2051 }
2052 }
2053 if (folio_test_uptodate(folio)) {
2054 if (!buffer_uptodate(bh))
2055 set_buffer_uptodate(bh);
2056 continue;
2057 }
2058 if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
2059 !buffer_unwritten(bh) &&
2060 (block_start < from || block_end > to)) {
2061 ll_rw_block(REQ_OP_READ, 0, 1, &bh);
2062 *wait_bh++=bh;
2063 }
2064 }
2065 /*
2066 * If we issued read requests - let them complete.
2067 */
2068 while(wait_bh > wait) {
2069 wait_on_buffer(*--wait_bh);
2070 if (!buffer_uptodate(*wait_bh))
2071 err = -EIO;
2072 }
2073 if (unlikely(err))
2074 page_zero_new_buffers(&folio->page, from, to);
2075 return err;
2076 }
2077
2078 int __block_write_begin(struct page *page, loff_t pos, unsigned len,
2079 get_block_t *get_block)
2080 {
2081 return __block_write_begin_int(page_folio(page), pos, len, get_block,
2082 NULL, 0);
2083 }
2084 EXPORT_SYMBOL(__block_write_begin);
2085
2086 static int __block_commit_write(struct inode *inode, struct page *page,
2087 unsigned from, unsigned to)
2088 {
2089 unsigned block_start, block_end;
2090 int partial = 0;
2091 unsigned blocksize;
2092 struct buffer_head *bh, *head;
2093
2094 bh = head = page_buffers(page);
2095 blocksize = bh->b_size;
2096
2097 block_start = 0;
2098 do {
2099 block_end = block_start + blocksize;
2100 if (block_end <= from || block_start >= to) {
2101 if (!buffer_uptodate(bh))
2102 partial = 1;
2103 } else {
2104 set_buffer_uptodate(bh);
2105 mark_buffer_dirty(bh);
2106 }
2107 if (buffer_new(bh))
2108 clear_buffer_new(bh);
2109
2110 block_start = block_end;
2111 bh = bh->b_this_page;
2112 } while (bh != head);
2113
2114 /*
2115 * If this is a partial write which happened to make all buffers
2116 * uptodate then we can optimize away a bogus readpage() for
2117 * the next read(). Here we 'discover' whether the page went
2118 * uptodate as a result of this (potentially partial) write.
2119 */
2120 if (!partial)
2121 SetPageUptodate(page);
2122 return 0;
2123 }
2124
2125 /*
2126 * block_write_begin takes care of the basic task of block allocation and
2127 * bringing partial write blocks uptodate first.
2128 *
2129 * The filesystem needs to handle block truncation upon failure.
2130 */
2131 int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
2132 unsigned flags, struct page **pagep, get_block_t *get_block)
2133 {
2134 pgoff_t index = pos >> PAGE_SHIFT;
2135 struct page *page;
2136 int status;
2137 gfp_t gfp = 0;
2138 bool no_wait = (flags & AOP_FLAG_NOWAIT);
2139
2140 if (no_wait)
2141 gfp = GFP_ATOMIC | __GFP_NOWARN;
2142
2143 page = grab_cache_page_write_begin(mapping, index, flags);
2144 if (!page)
2145 return -ENOMEM;
2146
> 2147 status = __block_write_begin_int(page_folio(page), pos, len, get_block, NULL, gfp);
2148 if (unlikely(status)) {
2149 unlock_page(page);
2150 put_page(page);
2151 page = NULL;
2152 }
2153
2154 *pagep = page;
2155 return status;
2156 }
2157 EXPORT_SYMBOL(block_write_begin);
2158
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-20 4:23 ` Matthew Wilcox
2022-02-20 4:38 ` Jens Axboe
@ 2022-02-22 8:18 ` Christoph Hellwig
2022-02-22 23:19 ` Jens Axboe
1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2022-02-22 8:18 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Christoph Hellwig, Stefan Roesch, io-uring, linux-fsdevel,
linux-block, kernel-team
On Sun, Feb 20, 2022 at 04:23:50AM +0000, Matthew Wilcox wrote:
> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
> > Err, hell no. Please do not add any new functionality to the legacy
> > buffer head code. If you want new features do that on the
> > non-bufferhead iomap code path only please.
>
> I think "first convert the block device code from buffer_heads to iomap"
> might be a bit much of a prerequisite. I think running ext4 on top of a
> block device still requires buffer_heads, for example (I tried to convert
> the block device to use mpage in order to avoid creating buffer_heads
> when possible, and ext4 stopped working. I didn't try too hard to debug
> it as it was a bit of a distraction at the time).
Oh, I did not spot the users here is the block device. Which is really
weird, why would anyone do buffered writes to a block devices? Doing
so is a bit of a data integrity nightmare.
Can we please develop this feature for iomap based file systems first,
and if by then a use case for block devices arises I'll see what we can
do there. I've been planning to get the block device code to stop using
buffer_heads by default, but taking them into account if used by a
legacy buffer_head user anyway.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v2 04/13] fs: split off __alloc_page_buffers function
2022-02-22 8:18 ` Christoph Hellwig
@ 2022-02-22 23:19 ` Jens Axboe
0 siblings, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2022-02-22 23:19 UTC (permalink / raw)
To: Christoph Hellwig, Matthew Wilcox
Cc: Stefan Roesch, io-uring, linux-fsdevel, linux-block, kernel-team
On 2/22/22 1:18 AM, Christoph Hellwig wrote:
> On Sun, Feb 20, 2022 at 04:23:50AM +0000, Matthew Wilcox wrote:
>> On Fri, Feb 18, 2022 at 11:35:10PM -0800, Christoph Hellwig wrote:
>>> Err, hell no. Please do not add any new functionality to the legacy
>>> buffer head code. If you want new features do that on the
>>> non-bufferhead iomap code path only please.
>>
>> I think "first convert the block device code from buffer_heads to iomap"
>> might be a bit much of a prerequisite. I think running ext4 on top of a
>> block device still requires buffer_heads, for example (I tried to convert
>> the block device to use mpage in order to avoid creating buffer_heads
>> when possible, and ext4 stopped working. I didn't try too hard to debug
>> it as it was a bit of a distraction at the time).
>
> Oh, I did not spot the users here is the block device. Which is
> really weird, why would anyone do buffered writes to a block devices?
> Doing so is a bit of a data integrity nightmare.
>
> Can we please develop this feature for iomap based file systems first,
> and if by then a use case for block devices arises I'll see what we
> can do there.
The original plan wasn't to develop bdev async writes as a separate
useful feature, but rather to do it as a first step to both become
acquainted with the code base and solve some of the common issues for
both.
The fact that we need to touch buffer_heads for the bdev path is
annoying, and something that I'd very much rather just avoid. And
converting bdev to iomap first is a waste of time, exactly because it's
not a separately useful feature.
Hence I think we'll change gears here and start with iomap and XFS
instead.
> I've been planning to get the block device code to stop using
> buffer_heads by default, but taking them into account if used by a
> legacy buffer_head user anyway.
That would indeed be great, and to be honest, the current code for bdev
read/write doesn't make much sense outside of from a historical point of
view.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2022-02-22 23:19 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-18 19:57 [PATCH v2 00/13] Support sync buffered writes for io-uring Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 01/13] fs: Add flags parameter to __block_write_begin_int Stefan Roesch
2022-02-18 19:59 ` Matthew Wilcox
2022-02-18 20:08 ` Stefan Roesch
2022-02-18 20:13 ` Matthew Wilcox
2022-02-18 20:14 ` Stefan Roesch
2022-02-18 20:22 ` Matthew Wilcox
2022-02-18 20:25 ` Stefan Roesch
2022-02-18 20:35 ` Matthew Wilcox
2022-02-18 20:39 ` Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 02/13] mm: Introduce do_generic_perform_write Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 03/13] mm: Add support for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 04/13] fs: split off __alloc_page_buffers function Stefan Roesch
2022-02-18 20:42 ` Matthew Wilcox
2022-02-18 20:50 ` Stefan Roesch
2022-02-19 7:35 ` Christoph Hellwig
2022-02-20 4:23 ` Matthew Wilcox
2022-02-20 4:38 ` Jens Axboe
2022-02-20 4:51 ` Jens Axboe
2022-02-22 8:18 ` Christoph Hellwig
2022-02-22 23:19 ` Jens Axboe
2022-02-18 19:57 ` [PATCH v2 05/13] fs: split off __create_empty_buffers function Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 06/13] fs: Add gfp_t parameter to create_page_buffers() Stefan Roesch
2022-02-21 0:18 ` kernel test robot
2022-02-18 19:57 ` [PATCH v2 07/13] fs: add support for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 08/13] io_uring: " Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 09/13] io_uring: Add tracepoint for short writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 10/13] sched: add new fields to task_struct Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 11/13] mm: support write throttling for async buffered writes Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 12/13] io_uring: " Stefan Roesch
2022-02-18 19:57 ` [PATCH v2 13/13] block: enable async buffered writes for block devices Stefan Roesch
2022-02-20 22:38 ` [PATCH v2 00/13] Support sync buffered writes for io-uring Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox