* [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
` (16 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also modifies the check in
generic_write_checks to take async buffered writes into consideration.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/read_write.c | 3 ++-
include/linux/fs.h | 3 +++
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index e643aec2b0ef..f75d75f7bc84 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1633,7 +1633,8 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
- if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+ if ((iocb->ki_flags & IOCB_NOWAIT) &&
+ !((iocb->ki_flags & IOCB_DIRECT) || (file->f_mode & FMODE_BUF_WASYNC)))
return -EINVAL;
return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..3b479d02e210 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -177,6 +177,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports async buffered reads */
#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000)
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC ((__force fmode_t)0x80000000)
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 19:06 ` Matthew Wilcox
2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
` (15 subsequent siblings)
17 siblings, 1 reply; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
Define FGP_ATOMIC flag and add support for the new flag in the function
__filemap_get_folio().
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/filemap.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 993994cd943a..b8d839e10780 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -509,6 +509,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
#define FGP_HEAD 0x00000080
#define FGP_ENTRY 0x00000100
#define FGP_STABLE 0x00000200
+#define FGP_ATOMIC 0x00000400
struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
int fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9a1eef6c5d35..dd3c5682276e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1929,6 +1929,7 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
* * %FGP_NOFS - __GFP_FS will get cleared in gfp.
* * %FGP_NOWAIT - Don't get blocked by page lock.
* * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_ATOMIC - Use atomic allocations
*
* If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
* if the %GFP flags specified for %FGP_CREAT are atomic.
@@ -1988,6 +1989,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
gfp |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
gfp &= ~__GFP_FS;
+ if (fgp_flags & FGP_ATOMIC)
+ gfp |= GFP_ATOMIC;
folio = filemap_alloc_folio(gfp, 0);
if (!folio)
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
@ 2022-04-26 19:06 ` Matthew Wilcox
2022-04-28 19:54 ` Stefan Roesch
0 siblings, 1 reply; 31+ messages in thread
From: Matthew Wilcox @ 2022-04-26 19:06 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david
On Tue, Apr 26, 2022 at 10:43:19AM -0700, Stefan Roesch wrote:
> Define FGP_ATOMIC flag and add support for the new flag in the function
> __filemap_get_folio().
No. You don't get to use the emergency memory pools for doing writes.
That completely screws up the MM.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
2022-04-26 19:06 ` Matthew Wilcox
@ 2022-04-28 19:54 ` Stefan Roesch
0 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-28 19:54 UTC (permalink / raw)
To: Matthew Wilcox
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david
On 4/26/22 12:06 PM, Matthew Wilcox wrote:
> On Tue, Apr 26, 2022 at 10:43:19AM -0700, Stefan Roesch wrote:
>> Define FGP_ATOMIC flag and add support for the new flag in the function
>> __filemap_get_folio().
>
> No. You don't get to use the emergency memory pools for doing writes.
> That completely screws up the MM.
The next version of the patch will remove this patch from the patch series and the
need of the atomic flag.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
` (14 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
Add the function iomap_page_create_gfp() to be able to specify gfp flags
and to pass in the number of blocks per folio in the function
iomap_page_create_gfp().
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/iomap/buffered-io.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..85aa32f50db0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,17 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
static struct bio_set iomap_ioend_bioset;
+/**
+ * iomap_page_create_gfp : Create and initialize iomap_page for folio.
+ * @inode : Pointer to inode
+ * @folio : Pointer to folio
+ * @nr_blocks : Number of blocks in the folio
+ * @gfp : gfp allocation flags
+ *
+ * This function returns a newly allocated iomap for the folio with the settings
+ * specified in the gfp parameter.
+ *
+ **/
static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio)
+iomap_page_create_gfp(struct inode *inode, struct folio *folio,
+ unsigned int nr_blocks, gfp_t gfp)
{
- struct iomap_page *iop = to_iomap_page(folio);
- unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+ struct iomap_page *iop;
- if (iop || nr_blocks <= 1)
+ iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), gfp);
+ if (!iop)
return iop;
- iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
- GFP_NOFS | __GFP_NOFAIL);
spin_lock_init(&iop->uptodate_lock);
if (folio_test_uptodate(folio))
bitmap_fill(iop->uptodate, nr_blocks);
@@ -61,6 +71,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
return iop;
}
+static struct iomap_page *
+iomap_page_create(struct inode *inode, struct folio *folio)
+{
+ struct iomap_page *iop = to_iomap_page(folio);
+ unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+
+ if (iop || nr_blocks <= 1)
+ return iop;
+
+ return iomap_page_create_gfp(inode, folio, nr_blocks, GFP_NOFS | __GFP_NOFAIL);
+}
+
static void iomap_page_release(struct folio *folio)
{
struct iomap_page *iop = folio_detach_private(folio);
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (2 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
` (13 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This change uses the new iomap_page_create_gfp() function in the
function __iomap_write_begin().
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/iomap/buffered-io.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 85aa32f50db0..1ffdc7078e7d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -572,17 +572,22 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
size_t len, struct folio *folio)
{
const struct iomap *srcmap = iomap_iter_srcmap(iter);
- struct iomap_page *iop = iomap_page_create(iter->inode, folio);
+ struct iomap_page *iop = to_iomap_page(folio);
loff_t block_size = i_blocksize(iter->inode);
loff_t block_start = round_down(pos, block_size);
loff_t block_end = round_up(pos + len, block_size);
+ unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
size_t from = offset_in_folio(folio, pos), to = from + len;
size_t poff, plen;
+ gfp_t gfp = GFP_NOFS | __GFP_NOFAIL;
if (folio_test_uptodate(folio))
return 0;
folio_clear_error(folio);
+ if (!iop && nr_blocks > 1)
+ iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+
do {
iomap_adjust_read_range(iter->inode, folio, &block_start,
block_end - block_start, &poff, &plen);
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 05/18] iomap: add async buffered write support
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (3 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
` (12 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This adds async buffered write support to iomap. The support is focused
on the changes necessary to support XFS with iomap.
Support for other filesystems might require additional changes.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/iomap/buffered-io.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1ffdc7078e7d..5c53a5715c85 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,13 +580,20 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
size_t from = offset_in_folio(folio, pos), to = from + len;
size_t poff, plen;
gfp_t gfp = GFP_NOFS | __GFP_NOFAIL;
+ bool no_wait = (iter->flags & IOMAP_NOWAIT);
+
+ if (no_wait)
+ gfp = GFP_NOIO | GFP_ATOMIC;
if (folio_test_uptodate(folio))
return 0;
folio_clear_error(folio);
- if (!iop && nr_blocks > 1)
+ if (!iop && nr_blocks > 1) {
iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+ if (no_wait && !iop)
+ return -EAGAIN;
+ }
do {
iomap_adjust_read_range(iter->inode, folio, &block_start,
@@ -603,6 +610,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
return -EIO;
folio_zero_segments(folio, poff, from, to, poff + plen);
+ } else if (no_wait) {
+ return -EAGAIN;
} else {
int status = iomap_read_folio_sync(block_start, folio,
poff, plen, srcmap);
@@ -633,6 +642,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
int status = 0;
+ if (iter->flags & IOMAP_NOWAIT)
+ fgp |= FGP_NOWAIT | FGP_ATOMIC;
+
BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
if (srcmap != &iter->iomap)
BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -790,6 +802,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
+ *
+ * For async buffered writes the assumption is that the user
+ * page has already been faulted in. This can be optimized by
+ * faulting the user page in the prepare phase of io-uring.
*/
if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
status = -EFAULT;
@@ -845,6 +861,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
};
int ret;
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ iter.flags |= IOMAP_NOWAIT;
+
while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_write_iter(&iter, i);
if (iter.pos == iocb->ki_pos)
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (4 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 22:54 ` Dave Chinner
2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
` (11 subsequent siblings)
17 siblings, 1 reply; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This adds the async buffered write support to the iomap layer of XFS. If
a lock cannot be acquired or additional reads need to be performed, the
request will return -EAGAIN in case this is an async buffered write request.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e552ce541ec2..80b6c48e88af 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
bool eof = false, cow_eof = false, shared = false;
int allocfork = XFS_DATA_FORK;
int error = 0;
+ bool no_wait = (flags & IOMAP_NOWAIT);
if (xfs_is_shutdown(mp))
return -EIO;
/* we can't use delayed allocations when using extent size hints */
- if (xfs_get_extsz_hint(ip))
+ if (xfs_get_extsz_hint(ip)) {
+ if (no_wait)
+ return -EAGAIN;
+
return xfs_direct_write_iomap_begin(inode, offset, count,
flags, iomap, srcmap);
+ }
ASSERT(!XFS_IS_REALTIME_INODE(ip));
- xfs_ilock(ip, XFS_ILOCK_EXCL);
+ if (no_wait) {
+ if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+ return -EAGAIN;
+ } else {
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
+ }
if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
@@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
XFS_STATS_INC(mp, xs_blk_mapw);
+ if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
+ error = -EAGAIN;
+ goto out_unlock;
+ }
+
error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
if (error)
goto out_unlock;
@@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
if (xfs_is_cow_inode(ip)) {
if (!ip->i_cowfp) {
ASSERT(!xfs_is_reflink_inode(ip));
+ if (no_wait) {
+ error = -EAGAIN;
+ goto out_unlock;
+ }
xfs_ifork_init_cow(ip);
}
cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
@@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
goto found_imap;
}
+ if (no_wait) {
+ error = -EAGAIN;
+ goto out_unlock;
+ }
+
xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
/* Trim the mapping to the nearest shared extent boundary. */
@@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
allocfork = XFS_COW_FORK;
}
+ if (no_wait) {
+ error = -EAGAIN;
+ goto out_unlock;
+ }
+
error = xfs_qm_dqattach_locked(ip, false);
if (error)
goto out_unlock;
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
@ 2022-04-26 22:54 ` Dave Chinner
2022-04-28 20:03 ` Stefan Roesch
0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2022-04-26 22:54 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
On Tue, Apr 26, 2022 at 10:43:23AM -0700, Stefan Roesch wrote:
> This adds the async buffered write support to the iomap layer of XFS. If
> a lock cannot be acquired or additional reads need to be performed, the
> request will return -EAGAIN in case this is an async buffered write request.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> ---
> fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
> 1 file changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e552ce541ec2..80b6c48e88af 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
> bool eof = false, cow_eof = false, shared = false;
> int allocfork = XFS_DATA_FORK;
> int error = 0;
> + bool no_wait = (flags & IOMAP_NOWAIT);
>
> if (xfs_is_shutdown(mp))
> return -EIO;
>
> /* we can't use delayed allocations when using extent size hints */
> - if (xfs_get_extsz_hint(ip))
> + if (xfs_get_extsz_hint(ip)) {
> + if (no_wait)
> + return -EAGAIN;
> +
> return xfs_direct_write_iomap_begin(inode, offset, count,
> flags, iomap, srcmap);
Why can't we do IOMAP_NOWAIT allocation here?
xfs_direct_write_iomap_begin() supports IOMAP_NOWAIT semantics just
fine - that's what the direct IO path uses...
> + }
>
> ASSERT(!XFS_IS_REALTIME_INODE(ip));
>
> - xfs_ilock(ip, XFS_ILOCK_EXCL);
> + if (no_wait) {
> + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> + return -EAGAIN;
> + } else {
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> + }
handled by xfs_ilock_for_iomap().
>
> if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
> XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
> @@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
>
> XFS_STATS_INC(mp, xs_blk_mapw);
>
> + if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }
> +
Also handled by xfs_ilock_for_iomap().
> error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
> if (error)
> goto out_unlock;
> @@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
> if (xfs_is_cow_inode(ip)) {
> if (!ip->i_cowfp) {
> ASSERT(!xfs_is_reflink_inode(ip));
> + if (no_wait) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }
> xfs_ifork_init_cow(ip);
> }
Also handled by xfs_ilock_for_iomap().
> cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
> @@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
> goto found_imap;
> }
>
> + if (no_wait) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }
> +
Won't get here because this is COW path, and xfs_ilock_for_iomap()
handles returning -EAGAIN for that case.
> xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>
> /* Trim the mapping to the nearest shared extent boundary. */
> @@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
> allocfork = XFS_COW_FORK;
> }
>
> + if (no_wait) {
> + error = -EAGAIN;
> + goto out_unlock;
> + }
Why? Delayed allocation doesn't block...
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
2022-04-26 22:54 ` Dave Chinner
@ 2022-04-28 20:03 ` Stefan Roesch
0 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-28 20:03 UTC (permalink / raw)
To: Dave Chinner; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
On 4/26/22 3:54 PM, Dave Chinner wrote:
> On Tue, Apr 26, 2022 at 10:43:23AM -0700, Stefan Roesch wrote:
>> This adds the async buffered write support to the iomap layer of XFS. If
>> a lock cannot be acquired or additional reads need to be performed, the
>> request will return -EAGAIN in case this is an async buffered write request.
>>
>> Signed-off-by: Stefan Roesch <[email protected]>
>> ---
>> fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
>> 1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index e552ce541ec2..80b6c48e88af 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
>> bool eof = false, cow_eof = false, shared = false;
>> int allocfork = XFS_DATA_FORK;
>> int error = 0;
>> + bool no_wait = (flags & IOMAP_NOWAIT);
>>
>> if (xfs_is_shutdown(mp))
>> return -EIO;
>>
>> /* we can't use delayed allocations when using extent size hints */
>> - if (xfs_get_extsz_hint(ip))
>> + if (xfs_get_extsz_hint(ip)) {
>> + if (no_wait)
>> + return -EAGAIN;
>> +
>> return xfs_direct_write_iomap_begin(inode, offset, count,
>> flags, iomap, srcmap);
>
> Why can't we do IOMAP_NOWAIT allocation here?
> xfs_direct_write_iomap_begin() supports IOMAP_NOWAIT semantics just
> fine - that's what the direct IO path uses...
I'll make that change in the next version.
>
>> + }
>>
>> ASSERT(!XFS_IS_REALTIME_INODE(ip));
>>
>> - xfs_ilock(ip, XFS_ILOCK_EXCL);
>> + if (no_wait) {
>> + if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>> + return -EAGAIN;
>> + } else {
>> + xfs_ilock(ip, XFS_ILOCK_EXCL);
>> + }
>
> handled by xfs_ilock_for_iomap().
The helper xfs_ilock_for_iomap cannot be used for this purpose. The function
xfs_ilock_for_iomap tries to deduce the lock mode. For the function xfs_file_buffered_write()
the lock mode must be exclusive. However this is not always the case when the
helper xfs_ilock_for_iomap is used. There are cases where xfs_is_cow_inode(() is not
true.
I'll add a new helper that will be used here.
>>
>> if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
>> XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
>> @@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
>>
>> XFS_STATS_INC(mp, xs_blk_mapw);
>>
>> + if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
>> + error = -EAGAIN;
>> + goto out_unlock;
>> + }
>> +
>
> Also handled by xfs_ilock_for_iomap().
I'll remove this check with the next version.
>
>> error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
>> if (error)
>> goto out_unlock;
>> @@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
>> if (xfs_is_cow_inode(ip)) {
>> if (!ip->i_cowfp) {
>> ASSERT(!xfs_is_reflink_inode(ip));
>> + if (no_wait) {
>> + error = -EAGAIN;
>> + goto out_unlock;
>> + }
>> xfs_ifork_init_cow(ip);
>> }
>
> Also handled by xfs_ilock_for_iomap().
>
I'll remove this check with the next version.
>> cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
>> @@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
>> goto found_imap;
>> }
>>
>> + if (no_wait) {
>> + error = -EAGAIN;
>> + goto out_unlock;
>> + }
>> +
>
> Won't get here because this is COW path, and xfs_ilock_for_iomap()
> handles returning -EAGAIN for that case.
I'll remove this check with the next version.
>
>> xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>>
>> /* Trim the mapping to the nearest shared extent boundary. */
>> @@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
>> allocfork = XFS_COW_FORK;
>> }
>>
>> + if (no_wait) {
>> + error = -EAGAIN;
>> + goto out_unlock;
>> + }
>
> Why? Delayed allocation doesn't block...
I'll remove this check with the next version.
>
> Cheers,
>
> Dave.
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs()
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (5 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
` (10 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This splits off the function need_remove_file_privs() from the function
do_remove_file_privs() from the function file_remove_privs().
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/inode.c | 56 +++++++++++++++++++++++++++++++---------------
include/linux/fs.h | 3 +++
2 files changed, 41 insertions(+), 18 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..79c5702ef7c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2010,17 +2010,8 @@ static int __remove_privs(struct user_namespace *mnt_userns,
return notify_change(mnt_userns, dentry, &newattrs, NULL);
}
-/*
- * Remove special file priviledges (suid, capabilities) when file is written
- * to or truncated.
- */
-int file_remove_privs(struct file *file)
+int need_file_remove_privs(struct inode *inode, struct dentry *dentry)
{
- struct dentry *dentry = file_dentry(file);
- struct inode *inode = file_inode(file);
- int kill;
- int error = 0;
-
/*
* Fast path for nothing security related.
* As well for non-regular files, e.g. blkdev inodes.
@@ -2030,16 +2021,37 @@ int file_remove_privs(struct file *file)
if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode))
return 0;
- kill = dentry_needs_remove_privs(dentry);
- if (kill < 0)
- return kill;
- if (kill)
- error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
+ return dentry_needs_remove_privs(dentry);
+}
+
+int do_file_remove_privs(struct file *file, struct inode *inode,
+ struct dentry *dentry, int kill)
+{
+ int error = 0;
+
+ error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
if (!error)
inode_has_no_xattr(inode);
return error;
}
+
+/*
+ * Remove special file privileges (suid, capabilities) when file is written
+ * to or truncated.
+ */
+int file_remove_privs(struct file *file)
+{
+ struct dentry *dentry = file_dentry(file);
+ struct inode *inode = file_inode(file);
+ int kill;
+
+ kill = need_file_remove_privs(inode, dentry);
+ if (kill <= 0)
+ return kill;
+
+ return do_file_remove_privs(file, inode, dentry, kill);
+}
EXPORT_SYMBOL(file_remove_privs);
/**
@@ -2094,14 +2106,22 @@ EXPORT_SYMBOL(file_update_time);
int file_modified(struct file *file)
{
int err;
+ int ret;
+ struct dentry *dentry = file_dentry(file);
+ struct inode *inode = file_inode(file);
/*
* Clear the security bits if the process is not being run by root.
* This keeps people from modifying setuid and setgid binaries.
*/
- err = file_remove_privs(file);
- if (err)
- return err;
+ ret = need_file_remove_privs(inode, dentry);
+ if (ret < 0) {
+ return ret;
+ } else if (ret > 0) {
+ ret = do_file_remove_privs(file, inode, dentry, ret);
+ if (ret)
+ return ret;
+ }
if (unlikely(file->f_mode & FMODE_NOCMTIME))
return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b479d02e210..c2992d12601f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2381,6 +2381,9 @@ static inline void file_accessed(struct file *file)
touch_atime(&file->f_path);
}
+extern int need_file_remove_privs(struct inode *inode, struct dentry *dentry);
+extern int do_file_remove_privs(struct file *file, struct inode *inode,
+ struct dentry *dentry, int kill);
extern int file_modified(struct file *file);
int sync_inode_metadata(struct inode *inode, int wait);
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (6 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
` (9 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This splits off the functions need_file_update_time() and
do_file_update_time() from the function file_update_time().
This is required to support async buffered writes.
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/inode.c | 72 ++++++++++++++++++++++++++++++----------------
include/linux/fs.h | 5 ++++
2 files changed, 52 insertions(+), 25 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 79c5702ef7c3..64047bb0b9f8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2054,35 +2054,22 @@ int file_remove_privs(struct file *file)
}
EXPORT_SYMBOL(file_remove_privs);
-/**
- * file_update_time - update mtime and ctime time
- * @file: file accessed
- *
- * Update the mtime and ctime members of an inode and mark the inode
- * for writeback. Note that this function is meant exclusively for
- * usage in the file write path of filesystems, and filesystems may
- * choose to explicitly ignore update via this function with the
- * S_NOCMTIME inode flag, e.g. for network filesystem where these
- * timestamps are handled by the server. This can return an error for
- * file systems who need to allocate space in order to update an inode.
- */
-
-int file_update_time(struct file *file)
+int need_file_update_time(struct inode *inode, struct file *file,
+ struct timespec64 *now)
{
- struct inode *inode = file_inode(file);
- struct timespec64 now;
int sync_it = 0;
- int ret;
+
+ if (unlikely(file->f_mode & FMODE_NOCMTIME))
+ return 0;
/* First try to exhaust all avenues to not sync */
if (IS_NOCMTIME(inode))
return 0;
- now = current_time(inode);
- if (!timespec64_equal(&inode->i_mtime, &now))
+ if (!timespec64_equal(&inode->i_mtime, now))
sync_it = S_MTIME;
- if (!timespec64_equal(&inode->i_ctime, &now))
+ if (!timespec64_equal(&inode->i_ctime, now))
sync_it |= S_CTIME;
if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2091,24 +2078,58 @@ int file_update_time(struct file *file)
if (!sync_it)
return 0;
+ return sync_it;
+}
+
+int do_file_update_time(struct inode *inode, struct file *file,
+ struct timespec64 *now, int sync_mode)
+{
+ int ret;
+
/* Finally allowed to write? Takes lock. */
if (__mnt_want_write_file(file))
return 0;
- ret = inode_update_time(inode, &now, sync_it);
+ ret = inode_update_time(inode, now, sync_mode);
__mnt_drop_write_file(file);
return ret;
}
+
+/**
+ * file_update_time - update mtime and ctime time
+ * @file: file accessed
+ *
+ * Update the mtime and ctime members of an inode and mark the inode
+ * for writeback. Note that this function is meant exclusively for
+ * usage in the file write path of filesystems, and filesystems may
+ * choose to explicitly ignore update via this function with the
+ * S_NOCMTIME inode flag, e.g. for network filesystem where these
+ * timestamps are handled by the server. This can return an error for
+ * file systems who need to allocate space in order to update an inode.
+ */
+
+int file_update_time(struct file *file)
+{
+ int err;
+ struct inode *inode = file_inode(file);
+ struct timespec64 now = current_time(inode);
+
+ err = need_file_update_time(inode, file, &now);
+ if (err < 0)
+ return err;
+
+ return do_file_update_time(inode, file, &now, err);
+}
EXPORT_SYMBOL(file_update_time);
/* Caller must hold the file's inode lock */
int file_modified(struct file *file)
{
- int err;
int ret;
struct dentry *dentry = file_dentry(file);
struct inode *inode = file_inode(file);
+ struct timespec64 now = current_time(inode);
/*
* Clear the security bits if the process is not being run by root.
@@ -2123,10 +2144,11 @@ int file_modified(struct file *file)
return ret;
}
- if (unlikely(file->f_mode & FMODE_NOCMTIME))
- return 0;
+ ret = need_file_update_time(inode, file, &now);
+ if (ret <= 0)
+ return ret;
- return file_update_time(file);
+ return do_file_update_time(inode, file, &now, ret);
}
EXPORT_SYMBOL(file_modified);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2992d12601f..e268a1a50357 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2384,6 +2384,11 @@ static inline void file_accessed(struct file *file)
extern int need_file_remove_privs(struct inode *inode, struct dentry *dentry);
extern int do_file_remove_privs(struct file *file, struct inode *inode,
struct dentry *dentry, int kill);
+extern int need_file_update_time(struct inode *inode, struct file *file,
+ struct timespec64 *now);
+extern int do_file_update_time(struct inode *inode, struct file *file,
+ struct timespec64 *now, int sync_mode);
+
extern int file_modified(struct file *file);
int sync_inode_metadata(struct inode *inode, int wait);
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 09/18] fs: add pending file update time flag.
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (7 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
` (8 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This introduces an optimization for the update time flag and async
buffered writes. While an update of the file modification time is
pending and is handled by the workers, concurrent writes do not need
to wait for this time update to complete.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/inode.c | 1 +
include/linux/fs.h | 3 +++
2 files changed, 4 insertions(+)
diff --git a/fs/inode.c b/fs/inode.c
index 64047bb0b9f8..f6d9877c2bb8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2091,6 +2091,7 @@ int do_file_update_time(struct inode *inode, struct file *file,
return 0;
ret = inode_update_time(inode, now, sync_mode);
+ inode->i_flags &= ~S_PENDING_TIME;
__mnt_drop_write_file(file);
return ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e268a1a50357..dc9060c0d629 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2141,6 +2141,8 @@ struct super_operations {
#define S_CASEFOLD (1 << 15) /* Casefolded file */
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
#define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PENDING_TIME (1 << 18) /* File update time is pending */
+
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2183,6 +2185,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
+#define IS_PENDING_TIME(inode) ((inode)->i_flags & S_PENDING_TIME)
#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (8 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 22:55 ` Dave Chinner
2022-04-27 12:07 ` Christian Brauner
2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
` (7 subsequent siblings)
17 siblings, 2 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This modifies xfs write checks to return -EAGAIN if the request either
requires to remove privileges or needs to update the file modification
time. This is required for async buffered writes, so the request gets
handled in the io worker.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5bddb1e9e0b3..6f9da1059e8b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -299,6 +299,43 @@ xfs_file_read_iter(
return ret;
}
+static int xfs_file_modified(struct file *file, int flags)
+{
+ int ret;
+ struct dentry *dentry = file_dentry(file);
+ struct inode *inode = file_inode(file);
+ struct timespec64 now = current_time(inode);
+
+ /*
+ * Clear the security bits if the process is not being run by root.
+ * This keeps people from modifying setuid and setgid binaries.
+ */
+ ret = need_file_remove_privs(inode, dentry);
+ if (ret < 0) {
+ return ret;
+ } else if (ret > 0) {
+ if (flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ ret = do_file_remove_privs(file, inode, dentry, ret);
+ if (ret)
+ return ret;
+ }
+
+ ret = need_file_update_time(inode, file, &now);
+ if (ret <= 0)
+ return ret;
+ if (flags & IOCB_NOWAIT) {
+ if (IS_PENDING_TIME(inode))
+ return 0;
+
+ inode->i_flags |= S_PENDING_TIME;
+ return -EAGAIN;
+ }
+
+ return do_file_update_time(inode, file, &now, ret);
+}
+
/*
* Common pre-write limit and setup checks.
*
@@ -410,7 +447,7 @@ xfs_file_write_checks(
spin_unlock(&ip->i_flags_lock);
out:
- return file_modified(file);
+ return xfs_file_modified(file, iocb->ki_flags);
}
static int
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
@ 2022-04-26 22:55 ` Dave Chinner
2022-04-27 12:07 ` Christian Brauner
1 sibling, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2022-04-26 22:55 UTC (permalink / raw)
To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
On Tue, Apr 26, 2022 at 10:43:27AM -0700, Stefan Roesch wrote:
> This modifies xfs write checks to return -EAGAIN if the request either
> requires to remove privileges or needs to update the file modification
> time. This is required for async buffered writes, so the request gets
> handled in the io worker.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> ---
> fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5bddb1e9e0b3..6f9da1059e8b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -299,6 +299,43 @@ xfs_file_read_iter(
> return ret;
> }
>
> +static int xfs_file_modified(struct file *file, int flags)
> +{
> + int ret;
> + struct dentry *dentry = file_dentry(file);
> + struct inode *inode = file_inode(file);
> + struct timespec64 now = current_time(inode);
> +
> + /*
> + * Clear the security bits if the process is not being run by root.
> + * This keeps people from modifying setuid and setgid binaries.
> + */
> + ret = need_file_remove_privs(inode, dentry);
> + if (ret < 0) {
> + return ret;
> + } else if (ret > 0) {
> + if (flags & IOCB_NOWAIT)
> + return -EAGAIN;
> +
> + ret = do_file_remove_privs(file, inode, dentry, ret);
> + if (ret)
> + return ret;
> + }
> +
> + ret = need_file_update_time(inode, file, &now);
> + if (ret <= 0)
> + return ret;
> + if (flags & IOCB_NOWAIT) {
> + if (IS_PENDING_TIME(inode))
> + return 0;
> +
> + inode->i_flags |= S_PENDING_TIME;
> + return -EAGAIN;
> + }
> +
> + return do_file_update_time(inode, file, &now, ret);
> +}
This has nothing XFS specific in it. It belongs in the VFS code.
-Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
2022-04-26 22:55 ` Dave Chinner
@ 2022-04-27 12:07 ` Christian Brauner
1 sibling, 0 replies; 31+ messages in thread
From: Christian Brauner @ 2022-04-27 12:07 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david
On Tue, Apr 26, 2022 at 10:43:27AM -0700, Stefan Roesch wrote:
> This modifies xfs write checks to return -EAGAIN if the request either
> requires to remove privileges or needs to update the file modification
> time. This is required for async buffered writes, so the request gets
> handled in the io worker.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> ---
> fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5bddb1e9e0b3..6f9da1059e8b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -299,6 +299,43 @@ xfs_file_read_iter(
> return ret;
> }
>
> +static int xfs_file_modified(struct file *file, int flags)
This should probably be in fs/inode.c as:
int file_modified_async(struct file *file, int flags)
and then file_modified() can simply become:
int file_modified(struct file *file)
{
return file_modified_async(file, 0);
}
or even:
int file_modified_async(struct file *file, bool async)
int file_modified(struct file *file)
{
return file_modified_async(file, false);
}
to avoid piecing this together specifically in xfs (as Dave mentioned
this is all pretty generic).
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 11/18] xfs: add async buffered write support
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (9 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
[not found] ` <[email protected]>
2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
` (6 subsequent siblings)
17 siblings, 1 reply; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This adds the async buffered write support to XFS. For async buffered
write requests, the request will return -EAGAIN if the ilock cannot be
obtained immediately.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/xfs/xfs_file.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6f9da1059e8b..49d54b939502 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,12 +739,14 @@ xfs_file_buffered_write(
bool cleared_space = false;
int iolock;
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EOPNOTSUPP;
-
write_retry:
iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, iolock);
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!xfs_ilock_nowait(ip, iolock))
+ return -EAGAIN;
+ } else {
+ xfs_ilock(ip, iolock);
+ }
ret = xfs_file_write_checks(iocb, from, &iolock);
if (ret)
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 12/18] io_uring: add support for async buffered writes
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (10 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
` (5 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This enables the async buffered writes for the filesystems that support
async buffered writes in io-uring. Buffered writes are enabled for
blocks that are already in the page cache or can be acquired with noio.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7625b29153b9..2ba615374ba4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3746,7 +3746,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
return -EINVAL;
}
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
{
return req->flags & REQ_F_ISREG ||
S_ISBLK(file_inode(req->file)->i_mode);
@@ -3874,7 +3874,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
} else if (ret == -EIOCBQUEUED) {
goto out_free;
} else if (ret == req->result || ret <= 0 || !force_nonblock ||
- (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+ (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
/* read all, failed, already did sync or don't want to retry */
goto done;
}
@@ -3970,9 +3970,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!io_file_supports_nowait(req)))
goto copy_iov;
- /* file path doesn't support NOWAIT for non-direct_IO */
- if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
- (req->flags & REQ_F_ISREG))
+ /* File path supports NOWAIT for non-direct_IO only for block devices. */
+ if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+ !(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+ (req->flags & REQ_F_ISREG))
goto copy_iov;
kiocb->ki_flags |= IOCB_NOWAIT;
@@ -4026,6 +4027,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
/* IOPOLL retry should happen for io-wq threads */
if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
goto copy_iov;
+
+ if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
+ struct io_async_rw *rw;
+
+ /* This is a partial write. The file pos has already been
+ * updated, setup the async struct to complete the request
+ * in the worker. Also update bytes_done to account for
+ * the bytes already written.
+ */
+ iov_iter_save_state(&s->iter, &s->iter_state);
+ ret = io_setup_async_rw(req, iovec, s, true);
+
+ rw = req->async_data;
+ if (rw)
+ rw->bytes_done += ret2;
+
+ return ret ? ret : -EAGAIN;
+ }
done:
kiocb_done(req, ret2, issue_flags);
} else {
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (11 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
` (4 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 3 +++
include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2ba615374ba4..ace3a5cdda68 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4031,6 +4031,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
struct io_async_rw *rw;
+ trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+ req->result, ret2);
+
/* This is a partial write. The file pos has already been
* updated, setup the async struct to complete the request
* in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index cddf5b6fbeb4..661834361d33 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -543,6 +543,31 @@ TRACE_EVENT(io_uring_req_failed,
(unsigned long long) __entry->pad2, __entry->error)
);
+TRACE_EVENT(io_uring_short_write,
+
+ TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+ TP_ARGS(ctx, fpos, wanted, got),
+
+ TP_STRUCT__entry(
+ __field(void *, ctx)
+ __field(u64, fpos)
+ __field(u64, wanted)
+ __field(u64, got)
+ ),
+
+ TP_fast_assign(
+ __entry->ctx = ctx;
+ __entry->fpos = fpos;
+ __entry->wanted = wanted;
+ __entry->got = got;
+ ),
+
+ TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+ __entry->ctx, __entry->fpos,
+ __entry->wanted, __entry->got)
+);
+
#endif /* _TRACE_IO_URING_H */
/* This part must be outside protection */
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 14/18] sched: add new fields to task_struct
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (12 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
` (3 subsequent siblings)
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
Add two new fields to the task_struct to support async
write throttling.
- One field to store how long writes are throttled: bdp_pause
- The other field to store the number of dirtied pages:
bdp_nr_dirtied_pause
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/fork.c | 1 +
2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8911b1f35aa..98d70f2945e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1327,6 +1327,9 @@ struct task_struct {
/* Start of a write-and-pause period: */
unsigned long dirty_paused_when;
+ unsigned long bdp_pause;
+ int bdp_nr_dirtied_pause;
+
#ifdef CONFIG_LATENCYTOP
int latency_record_count;
struct latency_record latency_record[LT_SAVECOUNT];
diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..5bc6298827fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2316,6 +2316,7 @@ static __latent_entropy struct task_struct *copy_process(
p->nr_dirtied = 0;
p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
p->dirty_paused_when = 0;
+ p->bdp_nr_dirtied_pause = -1;
p->pdeath_signal = 0;
INIT_LIST_HEAD(&p->thread_group);
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (13 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-28 17:47 ` Jan Kara
2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
` (2 subsequent siblings)
17 siblings, 1 reply; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This change adds support for async write throttling in the function
balance_dirty_pages(). So far if throttling was required, the code was
waiting synchronously as long as the writes were throttled. This change
introduces asynchronous throttling. Instead of waiting in the function
balance_dirty_pages(), the timeout is set in the task_struct field
bdp_pause. Once the timeout has expired, the writes are no longer
throttled.
- Add a new parameter to the balance_dirty_pages() function
- This allows the caller to pass in the nowait flag
- When the nowait flag is specified, the code does not wait in
balance_dirty_pages(), but instead stores the wait expiration in the
new task_struct field bdp_pause.
- The function balance_dirty_pages_ratelimited() resets the new values
in the task_struct, once the timeout has expired
This change is required to support write throttling for the async
buffered writes. While the writes are throttled, io_uring still can make
progress with processing other requests.
Signed-off-by: Stefan Roesch <[email protected]>
---
include/linux/writeback.h | 1 +
mm/page-writeback.c | 54 ++++++++++++++++++++++++++++-----------
2 files changed, 40 insertions(+), 15 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..48176a8047db 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -373,6 +373,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
void wb_update_bandwidth(struct bdi_writeback *wb);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async);
bool wb_over_bg_thresh(struct bdi_writeback *wb);
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e2da284e427..a62aa8a4c2f2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,7 +1546,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
* perform some writeout.
*/
static void balance_dirty_pages(struct bdi_writeback *wb,
- unsigned long pages_dirtied)
+ unsigned long pages_dirtied, bool is_async)
{
struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1780,6 +1780,14 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
period,
pause,
start_time);
+ if (is_async) {
+ if (current->bdp_nr_dirtied_pause == -1) {
+ current->bdp_pause = now + pause;
+ current->bdp_nr_dirtied_pause = nr_dirtied_pause;
+ }
+ break;
+ }
+
__set_current_state(TASK_KILLABLE);
wb->dirty_sleep = now;
io_schedule_timeout(pause);
@@ -1787,6 +1795,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
current->nr_dirtied_pause = nr_dirtied_pause;
+ current->bdp_nr_dirtied_pause = -1;
+ current->bdp_pause = 0;
/*
* This is typically equal to (dirty < thresh) and can also
@@ -1851,19 +1861,7 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
*/
DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied. The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -1874,6 +1872,15 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
return;
+ if (current->bdp_nr_dirtied_pause != -1 && time_after(jiffies, current->bdp_pause)) {
+ current->dirty_paused_when = current->bdp_pause;
+ current->nr_dirtied = 0;
+ current->nr_dirtied_pause = current->bdp_nr_dirtied_pause;
+
+ current->bdp_nr_dirtied_pause = -1;
+ current->bdp_pause = 0;
+ }
+
if (inode_cgwb_enabled(inode))
wb = wb_get_create_current(bdi, GFP_KERNEL);
if (!wb)
@@ -1912,10 +1919,27 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
preempt_enable();
if (unlikely(current->nr_dirtied >= ratelimit))
- balance_dirty_pages(wb, current->nr_dirtied);
+ balance_dirty_pages(wb, current->nr_dirtied, is_async);
wb_put(wb);
}
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied. The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+ balance_dirty_pages_ratelimited_flags(mapping, false);
+}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
/**
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-04-28 17:47 ` Jan Kara
2022-04-28 20:16 ` Stefan Roesch
0 siblings, 1 reply; 31+ messages in thread
From: Jan Kara @ 2022-04-28 17:47 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david
On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
> This change adds support for async write throttling in the function
> balance_dirty_pages(). So far if throttling was required, the code was
> waiting synchronously as long as the writes were throttled. This change
> introduces asynchronous throttling. Instead of waiting in the function
> balance_dirty_pages(), the timeout is set in the task_struct field
> bdp_pause. Once the timeout has expired, the writes are no longer
> throttled.
>
> - Add a new parameter to the balance_dirty_pages() function
> - This allows the caller to pass in the nowait flag
> - When the nowait flag is specified, the code does not wait in
> balance_dirty_pages(), but instead stores the wait expiration in the
> new task_struct field bdp_pause.
>
> - The function balance_dirty_pages_ratelimited() resets the new values
> in the task_struct, once the timeout has expired
>
> This change is required to support write throttling for the async
> buffered writes. While the writes are throttled, io_uring still can make
> progress with processing other requests.
>
> Signed-off-by: Stefan Roesch <[email protected]>
Maybe I miss something but I don't think this will throttle writers enough.
For three reasons:
1) The calculated throttling pauses should accumulate for the task so that
if we compute that say it takes 0.1s to write 100 pages and the task writes
300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
be throttled as long as we expect the writeback to take.
2) We must not allow the amount of dirty pages to exceed the dirty limit.
That can easily lead to page reclaim getting into trouble reclaiming pages
and thus machine stalls, oom kills etc. So if we are coming close to dirty
limit and we cannot sleep, we must just fail the nowait write.
3) Even with above two problems fixed I suspect results will be suboptimal
because balance_dirty_pages() heuristics assume they get called reasonably
often and throttle writes so if amount of dirty pages is coming close to
dirty limit, they think we are overestimating writeback speed and update
throttling parameters accordingly. So if io_uring code does not throttle
writers often enough, I think dirty throttling parameters will be jumping
wildly resulting in poor behavior.
So what I'd probably suggest is that if balance_dirty_pages() is called in
"async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
balance_dirty_pages() decides the task needs to wait, we store the pause
and bail all the way up into the place where we can sleep (io_uring code I
assume), sleep there, and then continue doing write.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
2022-04-28 17:47 ` Jan Kara
@ 2022-04-28 20:16 ` Stefan Roesch
[not found] ` <[email protected]>
0 siblings, 1 reply; 31+ messages in thread
From: Stefan Roesch @ 2022-04-28 20:16 UTC (permalink / raw)
To: Jan Kara; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david
On 4/28/22 10:47 AM, Jan Kara wrote:
> On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
>> This change adds support for async write throttling in the function
>> balance_dirty_pages(). So far if throttling was required, the code was
>> waiting synchronously as long as the writes were throttled. This change
>> introduces asynchronous throttling. Instead of waiting in the function
>> balance_dirty_pages(), the timeout is set in the task_struct field
>> bdp_pause. Once the timeout has expired, the writes are no longer
>> throttled.
>>
>> - Add a new parameter to the balance_dirty_pages() function
>> - This allows the caller to pass in the nowait flag
>> - When the nowait flag is specified, the code does not wait in
>> balance_dirty_pages(), but instead stores the wait expiration in the
>> new task_struct field bdp_pause.
>>
>> - The function balance_dirty_pages_ratelimited() resets the new values
>> in the task_struct, once the timeout has expired
>>
>> This change is required to support write throttling for the async
>> buffered writes. While the writes are throttled, io_uring still can make
>> progress with processing other requests.
>>
>> Signed-off-by: Stefan Roesch <[email protected]>
>
> Maybe I miss something but I don't think this will throttle writers enough.
> For three reasons:
>
> 1) The calculated throttling pauses should accumulate for the task so that
> if we compute that say it takes 0.1s to write 100 pages and the task writes
> 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
> be throttled as long as we expect the writeback to take.
>
> 2) We must not allow the amount of dirty pages to exceed the dirty limit.
> That can easily lead to page reclaim getting into trouble reclaiming pages
> and thus machine stalls, oom kills etc. So if we are coming close to dirty
> limit and we cannot sleep, we must just fail the nowait write.
>
> 3) Even with above two problems fixed I suspect results will be suboptimal
> because balance_dirty_pages() heuristics assume they get called reasonably
> often and throttle writes so if amount of dirty pages is coming close to
> dirty limit, they think we are overestimating writeback speed and update
> throttling parameters accordingly. So if io_uring code does not throttle
> writers often enough, I think dirty throttling parameters will be jumping
> wildly resulting in poor behavior.
>
> So what I'd probably suggest is that if balance_dirty_pages() is called in
> "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
> balance_dirty_pages() decides the task needs to wait, we store the pause
> and bail all the way up into the place where we can sleep (io_uring code I
> assume), sleep there, and then continue doing write.
>
Jan, thanks for the feedback. Are you suggesting to change the following check
in the function balance_dirty_pages():
/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
* when the wb limits are ramping up in case of !strictlimit.
*
* In strictlimit case make decision based on the wb counters
* and limits. Small writeouts when the wb limits are ramping
* up are the price we consciously pay for strictlimit-ing.
*
* If memcg domain is in effect, @dirty should be under
* both global and memcg freerun ceilings.
*/
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
(!mdtc ||
m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
unsigned long intv;
unsigned long m_intv;
to include if we are in async mode?
There is no direct way to return that the process should sleep. Instead two new
fields are introduced in the proc structure. These two fields are then used in
io_uring to determine if the writes for a task need to be throttled.
In case the writes need to be throttled, the writes are not issued, but instead
inserted on a wait queue. We cannot sleep in the general io_uring code path as
we still want to process other requests which are affected by the throttling.
> Honza
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH v1 16/18] iomap: User throttling for async buffered writes.
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (14 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This enables throttling for async buffered writes.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/iomap/buffered-io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5c53a5715c85..0f0a6fe36699 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -843,7 +843,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
written += status;
length -= status;
- balance_dirty_pages_ratelimited(iter->inode->i_mapping);
+ balance_dirty_pages_ratelimited_flags(iter->inode->i_mapping,
+ (iter->flags & IOMAP_NOWAIT));
} while (iov_iter_count(i) && length);
return written ? written : status;
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 17/18] io_uring: support write throttling for async buffered writes
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (15 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This adds the process-level throttling for the block layer for async
buffered writes to io-uring. In io_write the code now checks if the write
needs to be throttled. If this is required, it adds the request to the
list of pending io requests and starts a timer. After the timer expires,
it submits the list of pending writes.
- Add new list called pending_ios for delayed writes (throttled writes)
to struct io_uring_task. The list is protected by the task_lock spin
lock.
- Add new timer to struct io_uring_task.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 92 insertions(+), 7 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index ace3a5cdda68..4af654a82366 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -485,6 +485,11 @@ struct io_ring_ctx {
*/
#define IO_RINGFD_REG_MAX 16
+struct pending_list {
+ struct list_head list;
+ struct io_kiocb *req;
+};
+
struct io_uring_task {
/* submission side */
int cached_refs;
@@ -501,6 +506,9 @@ struct io_uring_task {
struct callback_head task_work;
struct file **registered_rings;
bool task_running;
+
+ struct pending_list pending_ios;
+ struct timer_list timer;
};
/*
@@ -1193,7 +1201,7 @@ static void io_rsrc_put_work(struct work_struct *work);
static void io_req_task_queue(struct io_kiocb *req);
static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
-static int io_req_prep_async(struct io_kiocb *req);
+static int io_req_prep_async(struct io_kiocb *req, bool force);
static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
unsigned int issue_flags, u32 slot_index);
@@ -1201,6 +1209,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
static void io_eventfd_signal(struct io_ring_ctx *ctx);
+static void delayed_write_fn(struct timer_list *tmr);
static struct kmem_cache *req_cachep;
@@ -2599,6 +2608,32 @@ static void io_req_task_queue_reissue(struct io_kiocb *req)
io_req_task_work_add(req, false);
}
+static int io_req_task_queue_reissue_delayed(struct io_kiocb *req)
+{
+ struct io_uring_task *tctx = req->task->io_uring;
+ struct pending_list *pending;
+ bool empty;
+
+ pending = kmalloc(sizeof(struct pending_list), GFP_KERNEL);
+ if (!pending)
+ return -ENOMEM;
+ pending->req = req;
+
+ spin_lock_irq(&tctx->task_lock);
+ empty = list_empty(&tctx->pending_ios.list);
+ list_add_tail(&pending->list, &tctx->pending_ios.list);
+
+ if (empty) {
+ timer_setup(&tctx->timer, delayed_write_fn, 0);
+
+ tctx->timer.expires = current->bdp_pause;
+ add_timer(&tctx->timer);
+ }
+ spin_unlock_irq(&tctx->task_lock);
+
+ return 0;
+}
+
static inline void io_queue_next(struct io_kiocb *req)
{
struct io_kiocb *nxt = io_req_find_next(req);
@@ -2916,7 +2951,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
struct io_async_rw *rw = req->async_data;
if (!req_has_async_data(req))
- return !io_req_prep_async(req);
+ return !io_req_prep_async(req, false);
iov_iter_restore(&rw->s.iter, &rw->s.iter_state);
return true;
}
@@ -3938,6 +3973,38 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}
+static inline unsigned long write_delay(void)
+{
+ if (likely(current->bdp_nr_dirtied_pause == -1 ||
+ !time_before(jiffies, current->bdp_pause)))
+ return 0;
+
+ return current->bdp_pause;
+}
+
+static void delayed_write_fn(struct timer_list *tmr)
+{
+ struct io_uring_task *tctx = from_timer(tctx, tmr, timer);
+ struct list_head *curr;
+ struct list_head *next;
+ LIST_HEAD(pending_ios);
+
+ /* Move list to temporary list. */
+ spin_lock_irq(&tctx->task_lock);
+ list_splice_init(&tctx->pending_ios.list, &pending_ios);
+ spin_unlock_irq(&tctx->task_lock);
+
+ list_for_each_safe(curr, next, &pending_ios) {
+ struct pending_list *io;
+
+ io = list_entry(curr, struct pending_list, list);
+ io_req_task_queue_reissue(io->req);
+
+ list_del(curr);
+ kfree(io);
+ }
+}
+
static int io_write(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rw_state __s, *s = &__s;
@@ -3947,6 +4014,18 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
ssize_t ret, ret2;
loff_t *ppos;
+ /* Write throttling active? */
+ if (!(kiocb->ki_flags & IOCB_DIRECT) && unlikely(write_delay())) {
+ int ret = io_req_prep_async(req, true);
+
+ if (unlikely(ret))
+ io_req_complete_failed(req, ret);
+ else
+ ret = io_req_task_queue_reissue_delayed(req);
+
+ return ret;
+ }
+
if (!req_has_async_data(req)) {
ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags);
if (unlikely(ret < 0))
@@ -6962,9 +7041,9 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return -EINVAL;
}
-static int io_req_prep_async(struct io_kiocb *req)
+static int io_req_prep_async(struct io_kiocb *req, bool force)
{
- if (!io_op_defs[req->opcode].needs_async_setup)
+ if (!force && !io_op_defs[req->opcode].needs_async_setup)
return 0;
if (WARN_ON_ONCE(req_has_async_data(req)))
return -EFAULT;
@@ -6974,6 +7053,10 @@ static int io_req_prep_async(struct io_kiocb *req)
switch (req->opcode) {
case IORING_OP_READV:
return io_rw_prep_async(req, READ);
+ case IORING_OP_WRITE:
+ if (!force)
+ break;
+ fallthrough;
case IORING_OP_WRITEV:
return io_rw_prep_async(req, WRITE);
case IORING_OP_SENDMSG:
@@ -6983,6 +7066,7 @@ static int io_req_prep_async(struct io_kiocb *req)
case IORING_OP_CONNECT:
return io_connect_prep_async(req);
}
+
printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
req->opcode);
return -EFAULT;
@@ -7016,7 +7100,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
}
spin_unlock(&ctx->completion_lock);
- ret = io_req_prep_async(req);
+ ret = io_req_prep_async(req, false);
if (ret) {
fail:
io_req_complete_failed(req, ret);
@@ -7550,7 +7634,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
} else if (unlikely(req->ctx->drain_active)) {
io_drain_req(req);
} else {
- int ret = io_req_prep_async(req);
+ int ret = io_req_prep_async(req, false);
if (unlikely(ret))
io_req_complete_failed(req, ret);
@@ -7746,7 +7830,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
struct io_kiocb *head = link->head;
if (!(req->flags & REQ_F_FAIL)) {
- ret = io_req_prep_async(req);
+ ret = io_req_prep_async(req, false);
if (unlikely(ret)) {
req_fail_link_node(req, ret);
if (!(head->flags & REQ_F_FAIL))
@@ -9222,6 +9306,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
INIT_WQ_LIST(&tctx->task_list);
INIT_WQ_LIST(&tctx->prior_task_list);
init_task_work(&tctx->task_work, tctx_task_work);
+ INIT_LIST_HEAD(&tctx->pending_ios.list);
return 0;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [RFC PATCH v1 18/18] xfs: enable async buffered write support
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
` (16 preceding siblings ...)
2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
17 siblings, 0 replies; 31+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david
This turns on the async buffered write support for XFS.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/xfs/xfs_file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 49d54b939502..7544dffaf032 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1210,7 +1210,7 @@ xfs_file_open(
return -EFBIG;
if (xfs_is_shutdown(XFS_M(inode->i_sb)))
return -EIO;
- file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+ file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
return 0;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 31+ messages in thread