* [PATCH v7 01/15] mm: Move starting of background writeback into the main balancing loop
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 02/15] mm: Move updates of dirty_exceeded into one place Stefan Roesch
` (14 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
From: Jan Kara <[email protected]>
We start background writeback if we are over background threshold after
exiting the main loop in balance_dirty_pages(). This may result in
basing the decision on already stale values (we may have slept for
significant amount of time) and it is also inconvenient for refactoring
needed for async dirty throttling. Move the check into the main waiting
loop.
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
---
mm/page-writeback.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 55c2776ae699..e59c523aed1a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1627,6 +1627,19 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
}
}
+ /*
+ * In laptop mode, we wait until hitting the higher threshold
+ * before starting background writeout, and then write out all
+ * the way down to the lower threshold. So slow writers cause
+ * minimal disk activity.
+ *
+ * In normal mode, we start background writeout at the lower
+ * background_thresh, to keep the amount of dirty memory low.
+ */
+ if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
+ !writeback_in_progress(wb))
+ wb_start_background_writeback(wb);
+
/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
@@ -1657,6 +1670,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
break;
}
+ /* Start writeback even when in laptop mode */
if (unlikely(!writeback_in_progress(wb)))
wb_start_background_writeback(wb);
@@ -1823,23 +1837,6 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
if (!dirty_exceeded && wb->dirty_exceeded)
wb->dirty_exceeded = 0;
-
- if (writeback_in_progress(wb))
- return;
-
- /*
- * In laptop mode, we wait until hitting the higher threshold before
- * starting background writeout, and then write out all the way down
- * to the lower threshold. So slow writers cause minimal disk activity.
- *
- * In normal mode, we start background writeout at the lower
- * background_thresh, to keep the amount of dirty memory low.
- */
- if (laptop_mode)
- return;
-
- if (nr_reclaimable > gdtc->bg_thresh)
- wb_start_background_writeback(wb);
}
static DEFINE_PER_CPU(int, bdp_ratelimits);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 02/15] mm: Move updates of dirty_exceeded into one place
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 01/15] mm: Move starting of background writeback into the main balancing loop Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 03/15] mm: Add balance_dirty_pages_ratelimited_flags() function Stefan Roesch
` (13 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
From: Jan Kara <[email protected]>
Transition of wb->dirty_exceeded from 0 to 1 happens before we go to
sleep in balance_dirty_pages() while transition from 1 to 0 happens when
exiting from balance_dirty_pages(), possibly based on old values. This
does not make a lot of sense since wb->dirty_exceeded should simply
reflect whether wb is over dirty limit and so we should ratelimit
entering to balance_dirty_pages() less. Move the two updates together.
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
---
mm/page-writeback.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e59c523aed1a..90b1998c16a1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1729,8 +1729,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
sdtc = mdtc;
}
- if (dirty_exceeded && !wb->dirty_exceeded)
- wb->dirty_exceeded = 1;
+ if (dirty_exceeded != wb->dirty_exceeded)
+ wb->dirty_exceeded = dirty_exceeded;
if (time_is_before_jiffies(READ_ONCE(wb->bw_time_stamp) +
BANDWIDTH_INTERVAL))
@@ -1834,9 +1834,6 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
if (fatal_signal_pending(current))
break;
}
-
- if (!dirty_exceeded && wb->dirty_exceeded)
- wb->dirty_exceeded = 0;
}
static DEFINE_PER_CPU(int, bdp_ratelimits);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 03/15] mm: Add balance_dirty_pages_ratelimited_flags() function
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 01/15] mm: Move starting of background writeback into the main balancing loop Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 02/15] mm: Move updates of dirty_exceeded into one place Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 04/15] iomap: Add flags parameter to iomap_page_create() Stefan Roesch
` (12 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
From: Jan Kara <[email protected]>
This adds the helper function balance_dirty_pages_ratelimited_flags().
It adds the parameter flags to balance_dirty_pages_ratelimited().
The flags parameter is passed to balance_dirty_pages(). For async
buffered writes the flag value will be BDP_ASYNC.
If balance_dirty_pages() gets called for async buffered write, we don't
want to wait. Instead we need to indicate to the caller that throttling
is needed so that it can stop writing and offload the rest of the write
to a context that can block.
The new helper function is also used by balance_dirty_pages_ratelimited().
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/writeback.h | 7 ++++++
mm/page-writeback.c | 48 +++++++++++++++++++++++++--------------
2 files changed, 38 insertions(+), 17 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index da21d63f70e2..b8c9610c2313 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -364,7 +364,14 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
void wb_update_bandwidth(struct bdi_writeback *wb);
+
+/* Invoke balance dirty pages in async mode. */
+#define BDP_ASYNC 0x0001
+
void balance_dirty_pages_ratelimited(struct address_space *mapping);
+int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
+ unsigned int flags);
+
bool wb_over_bg_thresh(struct bdi_writeback *wb);
typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 90b1998c16a1..684ab599438a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1554,8 +1554,8 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
* If we're over `background_thresh' then the writeback threads are woken to
* perform some writeout.
*/
-static void balance_dirty_pages(struct bdi_writeback *wb,
- unsigned long pages_dirtied)
+static int balance_dirty_pages(struct bdi_writeback *wb,
+ unsigned long pages_dirtied, unsigned int flags)
{
struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1575,6 +1575,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
struct backing_dev_info *bdi = wb->bdi;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;
+ int ret = 0;
for (;;) {
unsigned long now = jiffies;
@@ -1803,6 +1804,10 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
period,
pause,
start_time);
+ if (flags & BDP_ASYNC) {
+ ret = -EAGAIN;
+ break;
+ }
__set_current_state(TASK_KILLABLE);
wb->dirty_sleep = now;
io_schedule_timeout(pause);
@@ -1834,6 +1839,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
if (fatal_signal_pending(current))
break;
}
+ return ret;
}
static DEFINE_PER_CPU(int, bdp_ratelimits);
@@ -1854,28 +1860,18 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
*/
DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied. The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
+ unsigned int flags)
{
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
struct bdi_writeback *wb = NULL;
int ratelimit;
+ int ret = 0;
int *p;
if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
- return;
+ return ret;
if (inode_cgwb_enabled(inode))
wb = wb_get_create_current(bdi, GFP_KERNEL);
@@ -1915,9 +1911,27 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
preempt_enable();
if (unlikely(current->nr_dirtied >= ratelimit))
- balance_dirty_pages(wb, current->nr_dirtied);
+ balance_dirty_pages(wb, current->nr_dirtied, flags);
wb_put(wb);
+ return ret;
+}
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied. The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+ balance_dirty_pages_ratelimited_flags(mapping, 0);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 04/15] iomap: Add flags parameter to iomap_page_create()
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (2 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 03/15] mm: Add balance_dirty_pages_ratelimited_flags() function Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 16:26 ` Darrick J. Wong
2022-06-01 21:01 ` [PATCH v7 05/15] iomap: Add async buffered write support Stefan Roesch
` (11 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
Add the kiocb flags parameter to the function iomap_page_create().
Depending on the value of the flags parameter it enables different gfp
flags.
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/iomap/buffered-io.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d2a9f699e17e..705f80cd2d4e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -44,16 +44,23 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
static struct bio_set iomap_ioend_bioset;
static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio)
+iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
{
struct iomap_page *iop = to_iomap_page(folio);
unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+ gfp_t gfp;
if (iop || nr_blocks <= 1)
return iop;
+ if (flags & IOMAP_NOWAIT)
+ gfp = GFP_NOWAIT;
+ else
+ gfp = GFP_NOFS | __GFP_NOFAIL;
+
iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
- GFP_NOFS | __GFP_NOFAIL);
+ gfp);
+
spin_lock_init(&iop->uptodate_lock);
if (folio_test_uptodate(folio))
bitmap_fill(iop->uptodate, nr_blocks);
@@ -226,7 +233,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
if (WARN_ON_ONCE(size > iomap->length))
return -EIO;
if (offset > 0)
- iop = iomap_page_create(iter->inode, folio);
+ iop = iomap_page_create(iter->inode, folio, iter->flags);
else
iop = to_iomap_page(folio);
@@ -264,7 +271,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
return iomap_read_inline_data(iter, folio);
/* zero post-eof blocks as the page may be mapped */
- iop = iomap_page_create(iter->inode, folio);
+ iop = iomap_page_create(iter->inode, folio, iter->flags);
iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
if (plen == 0)
goto done;
@@ -547,7 +554,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
size_t len, struct folio *folio)
{
const struct iomap *srcmap = iomap_iter_srcmap(iter);
- struct iomap_page *iop = iomap_page_create(iter->inode, folio);
+ struct iomap_page *iop;
loff_t block_size = i_blocksize(iter->inode);
loff_t block_start = round_down(pos, block_size);
loff_t block_end = round_up(pos + len, block_size);
@@ -558,6 +565,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
return 0;
folio_clear_error(folio);
+ iop = iomap_page_create(iter->inode, folio, iter->flags);
+
do {
iomap_adjust_read_range(iter->inode, folio, &block_start,
block_end - block_start, &poff, &plen);
@@ -1329,7 +1338,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
struct writeback_control *wbc, struct inode *inode,
struct folio *folio, u64 end_pos)
{
- struct iomap_page *iop = iomap_page_create(inode, folio);
+ struct iomap_page *iop = iomap_page_create(inode, folio, 0);
struct iomap_ioend *ioend, *next;
unsigned len = i_blocksize(inode);
unsigned nblocks = i_blocks_per_folio(inode, folio);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 04/15] iomap: Add flags parameter to iomap_page_create()
2022-06-01 21:01 ` [PATCH v7 04/15] iomap: Add flags parameter to iomap_page_create() Stefan Roesch
@ 2022-06-02 16:26 ` Darrick J. Wong
0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-06-02 16:26 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe, Christoph Hellwig
On Wed, Jun 01, 2022 at 02:01:30PM -0700, Stefan Roesch wrote:
> Add the kiocb flags parameter to the function iomap_page_create().
> Depending on the value of the flags parameter it enables different gfp
> flags.
>
> No intended functional changes in this patch.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> fs/iomap/buffered-io.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d2a9f699e17e..705f80cd2d4e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -44,16 +44,23 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
> static struct bio_set iomap_ioend_bioset;
>
> static struct iomap_page *
> -iomap_page_create(struct inode *inode, struct folio *folio)
> +iomap_page_create(struct inode *inode, struct folio *folio, unsigned int flags)
> {
> struct iomap_page *iop = to_iomap_page(folio);
> unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
> + gfp_t gfp;
>
> if (iop || nr_blocks <= 1)
> return iop;
>
> + if (flags & IOMAP_NOWAIT)
> + gfp = GFP_NOWAIT;
> + else
> + gfp = GFP_NOFS | __GFP_NOFAIL;
Thanks for changing this!
Reviewed-by: Darrick J. Wong <[email protected]>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 05/15] iomap: Add async buffered write support
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (3 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 04/15] iomap: Add flags parameter to iomap_page_create() Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 06/15] iomap: Return error code from iomap_write_iter() Stefan Roesch
` (10 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This adds async buffered write support to iomap.
This replaces the call to balance_dirty_pages_ratelimited() with the
call to balance_dirty_pages_ratelimited_flags. This allows to specify if
the write request is async or not.
In addition this also moves the above function call to the beginning of
the function. If the function call is at the end of the function and the
decision is made to throttle writes, then there is no request that
io-uring can wait on. By moving it to the beginning of the function, the
write request is not issued, but returns -EAGAIN instead. io-uring will
punt the request and process it in the io-worker.
By moving the function call to the beginning of the function, the write
throttling will happen one page later.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/iomap/buffered-io.c | 33 ++++++++++++++++++++++++++++-----
1 file changed, 28 insertions(+), 5 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 705f80cd2d4e..b06a5c24a4db 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -558,6 +558,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
loff_t block_size = i_blocksize(iter->inode);
loff_t block_start = round_down(pos, block_size);
loff_t block_end = round_up(pos + len, block_size);
+ unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
size_t from = offset_in_folio(folio, pos), to = from + len;
size_t poff, plen;
@@ -566,6 +567,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
folio_clear_error(folio);
iop = iomap_page_create(iter->inode, folio, iter->flags);
+ if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1)
+ return -EAGAIN;
do {
iomap_adjust_read_range(iter->inode, folio, &block_start,
@@ -583,7 +586,12 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
return -EIO;
folio_zero_segments(folio, poff, from, to, poff + plen);
} else {
- int status = iomap_read_folio_sync(block_start, folio,
+ int status;
+
+ if (iter->flags & IOMAP_NOWAIT)
+ return -EAGAIN;
+
+ status = iomap_read_folio_sync(block_start, folio,
poff, plen, srcmap);
if (status)
return status;
@@ -612,6 +620,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
int status = 0;
+ if (iter->flags & IOMAP_NOWAIT)
+ fgp |= FGP_NOWAIT;
+
BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
if (srcmap != &iter->iomap)
BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -631,7 +642,7 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
fgp, mapping_gfp_mask(iter->inode->i_mapping));
if (!folio) {
- status = -ENOMEM;
+ status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
goto out_no_page;
}
if (pos + len > folio_pos(folio) + folio_size(folio))
@@ -749,6 +760,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
loff_t pos = iter->pos;
ssize_t written = 0;
long status = 0;
+ struct address_space *mapping = iter->inode->i_mapping;
+ unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
do {
struct folio *folio;
@@ -761,6 +774,11 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_count(i));
again:
+ status = balance_dirty_pages_ratelimited_flags(mapping,
+ bdp_flags);
+ if (unlikely(status))
+ break;
+
if (bytes > length)
bytes = length;
@@ -769,6 +787,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
+ *
+ * For async buffered writes the assumption is that the user
+ * page has already been faulted in. This can be optimized by
+ * faulting the user page.
*/
if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
status = -EFAULT;
@@ -780,7 +802,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
break;
page = folio_file_page(folio, pos >> PAGE_SHIFT);
- if (mapping_writably_mapped(iter->inode->i_mapping))
+ if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
copied = copy_page_from_iter_atomic(page, offset, bytes, i);
@@ -805,8 +827,6 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
pos += status;
written += status;
length -= status;
-
- balance_dirty_pages_ratelimited(iter->inode->i_mapping);
} while (iov_iter_count(i) && length);
return written ? written : status;
@@ -824,6 +844,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
};
int ret;
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ iter.flags |= IOMAP_NOWAIT;
+
while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_write_iter(&iter, i);
if (iter.pos == iocb->ki_pos)
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 06/15] iomap: Return error code from iomap_write_iter()
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (4 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 05/15] iomap: Add async buffered write support Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 12:38 ` Matthew Wilcox
2022-06-01 21:01 ` [PATCH v7 07/15] fs: Add check for async buffered writes to generic_write_checks Stefan Roesch
` (9 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
Change the signature of iomap_write_iter() to return an error code. In
case we cannot allocate a page in iomap_write_begin(), we will not retry
the memory alloction in iomap_write_begin().
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/iomap/buffered-io.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b06a5c24a4db..e96ab9a3072c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -754,12 +754,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
return ret;
}
-static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
+static int iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i, loff_t *processed)
{
loff_t length = iomap_length(iter);
loff_t pos = iter->pos;
ssize_t written = 0;
long status = 0;
+ int error = 0;
struct address_space *mapping = iter->inode->i_mapping;
unsigned int bdp_flags = (iter->flags & IOMAP_NOWAIT) ? BDP_ASYNC : 0;
@@ -774,9 +775,9 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
bytes = min_t(unsigned long, PAGE_SIZE - offset,
iov_iter_count(i));
again:
- status = balance_dirty_pages_ratelimited_flags(mapping,
+ error = balance_dirty_pages_ratelimited_flags(mapping,
bdp_flags);
- if (unlikely(status))
+ if (unlikely(error))
break;
if (bytes > length)
@@ -793,12 +794,12 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
* faulting the user page.
*/
if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
- status = -EFAULT;
+ error = -EFAULT;
break;
}
- status = iomap_write_begin(iter, pos, bytes, &folio);
- if (unlikely(status))
+ error = iomap_write_begin(iter, pos, bytes, &folio);
+ if (unlikely(error))
break;
page = folio_file_page(folio, pos >> PAGE_SHIFT);
@@ -829,7 +830,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
length -= status;
} while (iov_iter_count(i) && length);
- return written ? written : status;
+ *processed = written ? written : error;
+ return error;
}
ssize_t
@@ -843,12 +845,15 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
.flags = IOMAP_WRITE,
};
int ret;
+ int error = 0;
if (iocb->ki_flags & IOCB_NOWAIT)
iter.flags |= IOMAP_NOWAIT;
- while ((ret = iomap_iter(&iter, ops)) > 0)
- iter.processed = iomap_write_iter(&iter, i);
+ while ((ret = iomap_iter(&iter, ops)) > 0) {
+ if (error != -EAGAIN)
+ error = iomap_write_iter(&iter, i, &iter.processed);
+ }
if (iter.pos == iocb->ki_pos)
return ret;
return iter.pos - iocb->ki_pos;
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 06/15] iomap: Return error code from iomap_write_iter()
2022-06-01 21:01 ` [PATCH v7 06/15] iomap: Return error code from iomap_write_iter() Stefan Roesch
@ 2022-06-02 12:38 ` Matthew Wilcox
2022-06-02 17:08 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2022-06-02 12:38 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe
On Wed, Jun 01, 2022 at 02:01:32PM -0700, Stefan Roesch wrote:
> Change the signature of iomap_write_iter() to return an error code. In
> case we cannot allocate a page in iomap_write_begin(), we will not retry
> the memory alloction in iomap_write_begin().
loff_t can already represent an error code. And it's already used like
that.
> @@ -829,7 +830,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> length -= status;
> } while (iov_iter_count(i) && length);
>
> - return written ? written : status;
> + *processed = written ? written : error;
> + return error;
I think the change you really want is:
if (status == -EAGAIN)
return -EAGAIN;
if (written)
return written;
return status;
> @@ -843,12 +845,15 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
> .flags = IOMAP_WRITE,
> };
> int ret;
> + int error = 0;
>
> if (iocb->ki_flags & IOCB_NOWAIT)
> iter.flags |= IOMAP_NOWAIT;
>
> - while ((ret = iomap_iter(&iter, ops)) > 0)
> - iter.processed = iomap_write_iter(&iter, i);
> + while ((ret = iomap_iter(&iter, ops)) > 0) {
> + if (error != -EAGAIN)
> + error = iomap_write_iter(&iter, i, &iter.processed);
> + }
You don't need to change any of this. Look at how iomap_iter_advance()
works.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 06/15] iomap: Return error code from iomap_write_iter()
2022-06-02 12:38 ` Matthew Wilcox
@ 2022-06-02 17:08 ` Stefan Roesch
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-02 17:08 UTC (permalink / raw)
To: Matthew Wilcox
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe
On 6/2/22 5:38 AM, Matthew Wilcox wrote:
> On Wed, Jun 01, 2022 at 02:01:32PM -0700, Stefan Roesch wrote:
>> Change the signature of iomap_write_iter() to return an error code. In
>> case we cannot allocate a page in iomap_write_begin(), we will not retry
>> the memory alloction in iomap_write_begin().
>
> loff_t can already represent an error code. And it's already used like
> that.
>
>> @@ -829,7 +830,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>> length -= status;
>> } while (iov_iter_count(i) && length);
>>
>> - return written ? written : status;
>> + *processed = written ? written : error;
>> + return error;
>
> I think the change you really want is:
>
> if (status == -EAGAIN)
> return -EAGAIN;
> if (written)
> return written;
> return status;
>
Correct, I made the above change.
>> @@ -843,12 +845,15 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>> .flags = IOMAP_WRITE,
>> };
>> int ret;
>> + int error = 0;
>>
>> if (iocb->ki_flags & IOCB_NOWAIT)
>> iter.flags |= IOMAP_NOWAIT;
>>
>> - while ((ret = iomap_iter(&iter, ops)) > 0)
>> - iter.processed = iomap_write_iter(&iter, i);
>> + while ((ret = iomap_iter(&iter, ops)) > 0) {
>> + if (error != -EAGAIN)
>> + error = iomap_write_iter(&iter, i, &iter.processed);
>> + }
>
> You don't need to change any of this. Look at how iomap_iter_advance()
> works.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 07/15] fs: Add check for async buffered writes to generic_write_checks
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (5 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 06/15] iomap: Return error code from iomap_write_iter() Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 08/15] fs: add __remove_file_privs() with flags parameter Stefan Roesch
` (8 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also modifies the check in
generic_write_checks to take async buffered writes into consideration.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/read_write.c | 4 +++-
include/linux/fs.h | 3 +++
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index e643aec2b0ef..175d98713b9a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1633,7 +1633,9 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
- if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+ if ((iocb->ki_flags & IOCB_NOWAIT) &&
+ !((iocb->ki_flags & IOCB_DIRECT) ||
+ (file->f_mode & FMODE_BUF_WASYNC)))
return -EINVAL;
return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 01403e637271..bdf1ce48a458 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -180,6 +180,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports async buffered reads */
#define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000)
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC ((__force fmode_t)0x80000000)
+
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 08/15] fs: add __remove_file_privs() with flags parameter
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (6 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 07/15] fs: Add check for async buffered writes to generic_write_checks Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 9:04 ` Jan Kara
2022-06-01 21:01 ` [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time Stefan Roesch
` (7 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This adds the function __remove_file_privs, which allows the caller to
pass the kiocb flags parameter.
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/inode.c | 57 +++++++++++++++++++++++++++++++++++-------------------
1 file changed, 37 insertions(+), 20 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..ac1cf5aa78c8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2010,36 +2010,43 @@ static int __remove_privs(struct user_namespace *mnt_userns,
return notify_change(mnt_userns, dentry, &newattrs, NULL);
}
-/*
- * Remove special file priviledges (suid, capabilities) when file is written
- * to or truncated.
- */
-int file_remove_privs(struct file *file)
+static int __file_remove_privs(struct file *file, unsigned int flags)
{
struct dentry *dentry = file_dentry(file);
struct inode *inode = file_inode(file);
+ int error;
int kill;
- int error = 0;
- /*
- * Fast path for nothing security related.
- * As well for non-regular files, e.g. blkdev inodes.
- * For example, blkdev_write_iter() might get here
- * trying to remove privs which it is not allowed to.
- */
if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode))
return 0;
kill = dentry_needs_remove_privs(dentry);
- if (kill < 0)
+ if (kill <= 0)
return kill;
- if (kill)
- error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
+
+ if (flags & IOCB_NOWAIT)
+ return -EAGAIN;
+
+ error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
if (!error)
inode_has_no_xattr(inode);
return error;
}
+
+/**
+ * file_remove_privs - remove special file privileges (suid, capabilities)
+ * @file: file to remove privileges from
+ *
+ * When file is modified by a write or truncation ensure that special
+ * file privileges are removed.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int file_remove_privs(struct file *file)
+{
+ return __file_remove_privs(file, 0);
+}
EXPORT_SYMBOL(file_remove_privs);
/**
@@ -2090,18 +2097,28 @@ int file_update_time(struct file *file)
}
EXPORT_SYMBOL(file_update_time);
-/* Caller must hold the file's inode lock */
+/**
+ * file_modified - handle mandated vfs changes when modifying a file
+ * @file: file that was modified
+ *
+ * When file has been modified ensure that special
+ * file privileges are removed and time settings are updated.
+ *
+ * Context: Caller must hold the file's inode lock.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
int file_modified(struct file *file)
{
- int err;
+ int ret;
/*
* Clear the security bits if the process is not being run by root.
* This keeps people from modifying setuid and setgid binaries.
*/
- err = file_remove_privs(file);
- if (err)
- return err;
+ ret = __file_remove_privs(file, 0);
+ if (ret)
+ return ret;
if (unlikely(file->f_mode & FMODE_NOCMTIME))
return 0;
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 08/15] fs: add __remove_file_privs() with flags parameter
2022-06-01 21:01 ` [PATCH v7 08/15] fs: add __remove_file_privs() with flags parameter Stefan Roesch
@ 2022-06-02 9:04 ` Jan Kara
0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-06-02 9:04 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe, Christoph Hellwig
On Wed 01-06-22 14:01:34, Stefan Roesch wrote:
> This adds the function __remove_file_privs, which allows the caller to
> pass the kiocb flags parameter.
>
> No intended functional changes in this patch.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (7 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 08/15] fs: add __remove_file_privs() with flags parameter Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 8:44 ` Jan Kara
2022-06-02 12:57 ` Matthew Wilcox
2022-06-01 21:01 ` [PATCH v7 10/15] fs: Add async write file modification handling Stefan Roesch
` (6 subsequent siblings)
15 siblings, 2 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
This splits off the functions inode_needs_update_time() and
__file_update_time() from the function file_update_time().
This is required to support async buffered writes.
No intended functional changes in this patch.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/inode.c | 76 +++++++++++++++++++++++++++++++++++-------------------
1 file changed, 50 insertions(+), 26 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ac1cf5aa78c8..c44573a32c6a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2049,35 +2049,18 @@ int file_remove_privs(struct file *file)
}
EXPORT_SYMBOL(file_remove_privs);
-/**
- * file_update_time - update mtime and ctime time
- * @file: file accessed
- *
- * Update the mtime and ctime members of an inode and mark the inode
- * for writeback. Note that this function is meant exclusively for
- * usage in the file write path of filesystems, and filesystems may
- * choose to explicitly ignore update via this function with the
- * S_NOCMTIME inode flag, e.g. for network filesystem where these
- * timestamps are handled by the server. This can return an error for
- * file systems who need to allocate space in order to update an inode.
- */
-
-int file_update_time(struct file *file)
+static int inode_needs_update_time(struct inode *inode, struct timespec64 *now)
{
- struct inode *inode = file_inode(file);
- struct timespec64 now;
int sync_it = 0;
- int ret;
/* First try to exhaust all avenues to not sync */
if (IS_NOCMTIME(inode))
return 0;
- now = current_time(inode);
- if (!timespec64_equal(&inode->i_mtime, &now))
+ if (!timespec64_equal(&inode->i_mtime, now))
sync_it = S_MTIME;
- if (!timespec64_equal(&inode->i_ctime, &now))
+ if (!timespec64_equal(&inode->i_ctime, now))
sync_it |= S_CTIME;
if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2086,15 +2069,50 @@ int file_update_time(struct file *file)
if (!sync_it)
return 0;
- /* Finally allowed to write? Takes lock. */
- if (__mnt_want_write_file(file))
- return 0;
+ return sync_it;
+}
+
+static int __file_update_time(struct file *file, struct timespec64 *now,
+ int sync_mode)
+{
+ int ret = 0;
+ struct inode *inode = file_inode(file);
- ret = inode_update_time(inode, &now, sync_it);
- __mnt_drop_write_file(file);
+ /* try to update time settings */
+ if (!__mnt_want_write_file(file)) {
+ ret = inode_update_time(inode, now, sync_mode);
+ __mnt_drop_write_file(file);
+ }
return ret;
}
+
+ /**
+ * file_update_time - update mtime and ctime time
+ * @file: file accessed
+ *
+ * Update the mtime and ctime members of an inode and mark the inode for
+ * writeback. Note that this function is meant exclusively for usage in
+ * the file write path of filesystems, and filesystems may choose to
+ * explicitly ignore updates via this function with the _NOCMTIME inode
+ * flag, e.g. for network filesystem where these imestamps are handled
+ * by the server. This can return an error for file systems who need to
+ * allocate space in order to update an inode.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int file_update_time(struct file *file)
+{
+ int ret;
+ struct inode *inode = file_inode(file);
+ struct timespec64 now = current_time(inode);
+
+ ret = inode_needs_update_time(inode, &now);
+ if (ret <= 0)
+ return ret;
+
+ return __file_update_time(file, &now, ret);
+}
EXPORT_SYMBOL(file_update_time);
/**
@@ -2111,6 +2129,8 @@ EXPORT_SYMBOL(file_update_time);
int file_modified(struct file *file)
{
int ret;
+ struct inode *inode = file_inode(file);
+ struct timespec64 now = current_time(inode);
/*
* Clear the security bits if the process is not being run by root.
@@ -2123,7 +2143,11 @@ int file_modified(struct file *file)
if (unlikely(file->f_mode & FMODE_NOCMTIME))
return 0;
- return file_update_time(file);
+ ret = inode_needs_update_time(inode, &now);
+ if (ret <= 0)
+ return ret;
+
+ return __file_update_time(file, &now, ret);
}
EXPORT_SYMBOL(file_modified);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time
2022-06-01 21:01 ` [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time Stefan Roesch
@ 2022-06-02 8:44 ` Jan Kara
2022-06-02 12:57 ` Matthew Wilcox
1 sibling, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-06-02 8:44 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe
On Wed 01-06-22 14:01:35, Stefan Roesch wrote:
> This splits off the functions inode_needs_update_time() and
> __file_update_time() from the function file_update_time().
>
> This is required to support async buffered writes.
> No intended functional changes in this patch.
>
> Signed-off-by: Stefan Roesch <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time
2022-06-01 21:01 ` [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time Stefan Roesch
2022-06-02 8:44 ` Jan Kara
@ 2022-06-02 12:57 ` Matthew Wilcox
1 sibling, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2022-06-02 12:57 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe
On Wed, Jun 01, 2022 at 02:01:35PM -0700, Stefan Roesch wrote:
> + /**
> + * file_update_time - update mtime and ctime time
> + * @file: file accessed
> + *
> + * Update the mtime and ctime members of an inode and mark the inode for
> + * writeback. Note that this function is meant exclusively for usage in
> + * the file write path of filesystems, and filesystems may choose to
> + * explicitly ignore updates via this function with the _NOCMTIME inode
> + * flag, e.g. for network filesystem where these imestamps are handled
> + * by the server. This can return an error for file systems who need to
> + * allocate space in order to update an inode.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
Can you remove the extra leading space from each of these lines?
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 10/15] fs: Add async write file modification handling.
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (8 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 09/15] fs: Split off inode_needs_update_time and __file_update_time Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 8:44 ` Jan Kara
2022-06-02 9:06 ` Jan Kara
2022-06-01 21:01 ` [PATCH v7 11/15] fs: Optimization for concurrent file time updates Stefan Roesch
` (5 subsequent siblings)
15 siblings, 2 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This adds a file_modified_async() function to return -EAGAIN if the
request either requires to remove privileges or needs to update the file
modification time. This is required for async buffered writes, so the
request gets handled in the io worker of io-uring.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/inode.c | 43 +++++++++++++++++++++++++++++++++++++++++--
include/linux/fs.h | 1 +
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index c44573a32c6a..4503bed063e7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2116,17 +2116,21 @@ int file_update_time(struct file *file)
EXPORT_SYMBOL(file_update_time);
/**
- * file_modified - handle mandated vfs changes when modifying a file
+ * file_modified_flags - handle mandated vfs changes when modifying a file
* @file: file that was modified
+ * @flags: kiocb flags
*
* When file has been modified ensure that special
* file privileges are removed and time settings are updated.
*
+ * If IOCB_NOWAIT is set, special file privileges will not be removed and
+ * time settings will not be updated. It will return -EAGAIN.
+ *
* Context: Caller must hold the file's inode lock.
*
* Return: 0 on success, negative errno on failure.
*/
-int file_modified(struct file *file)
+static int file_modified_flags(struct file *file, int flags)
{
int ret;
struct inode *inode = file_inode(file);
@@ -2146,11 +2150,46 @@ int file_modified(struct file *file)
ret = inode_needs_update_time(inode, &now);
if (ret <= 0)
return ret;
+ if (flags & IOCB_NOWAIT)
+ return -EAGAIN;
return __file_update_time(file, &now, ret);
}
+
+/**
+ * file_modified - handle mandated vfs changes when modifying a file
+ * @file: file that was modified
+ *
+ * When file has been modified ensure that special
+ * file privileges are removed and time settings are updated.
+ *
+ * Context: Caller must hold the file's inode lock.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int file_modified(struct file *file)
+{
+ return file_modified_flags(file, 0);
+}
EXPORT_SYMBOL(file_modified);
+/**
+ * kiocb_modified - handle mandated vfs changes when modifying a file
+ * @iocb: iocb that was modified
+ *
+ * When file has been modified ensure that special
+ * file privileges are removed and time settings are updated.
+ *
+ * Context: Caller must hold the file's inode lock.
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int kiocb_modified(struct kiocb *iocb)
+{
+ return file_modified_flags(iocb->ki_filp, iocb->ki_flags);
+}
+EXPORT_SYMBOL_GPL(kiocb_modified);
+
int inode_needs_sync(struct inode *inode)
{
if (IS_SYNC(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bdf1ce48a458..553e57ec3efa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2392,6 +2392,7 @@ static inline void file_accessed(struct file *file)
}
extern int file_modified(struct file *file);
+int kiocb_modified(struct kiocb *iocb);
int sync_inode_metadata(struct inode *inode, int wait);
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 10/15] fs: Add async write file modification handling.
2022-06-01 21:01 ` [PATCH v7 10/15] fs: Add async write file modification handling Stefan Roesch
@ 2022-06-02 8:44 ` Jan Kara
2022-06-02 9:06 ` Jan Kara
1 sibling, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-06-02 8:44 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe, Christoph Hellwig
On Wed 01-06-22 14:01:36, Stefan Roesch wrote:
> This adds a file_modified_async() function to return -EAGAIN if the
> request either requires to remove privileges or needs to update the file
> modification time. This is required for async buffered writes, so the
> request gets handled in the io worker of io-uring.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/inode.c | 43 +++++++++++++++++++++++++++++++++++++++++--
> include/linux/fs.h | 1 +
> 2 files changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index c44573a32c6a..4503bed063e7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2116,17 +2116,21 @@ int file_update_time(struct file *file)
> EXPORT_SYMBOL(file_update_time);
>
> /**
> - * file_modified - handle mandated vfs changes when modifying a file
> + * file_modified_flags - handle mandated vfs changes when modifying a file
> * @file: file that was modified
> + * @flags: kiocb flags
> *
> * When file has been modified ensure that special
> * file privileges are removed and time settings are updated.
> *
> + * If IOCB_NOWAIT is set, special file privileges will not be removed and
> + * time settings will not be updated. It will return -EAGAIN.
> + *
> * Context: Caller must hold the file's inode lock.
> *
> * Return: 0 on success, negative errno on failure.
> */
> -int file_modified(struct file *file)
> +static int file_modified_flags(struct file *file, int flags)
> {
> int ret;
> struct inode *inode = file_inode(file);
> @@ -2146,11 +2150,46 @@ int file_modified(struct file *file)
> ret = inode_needs_update_time(inode, &now);
> if (ret <= 0)
> return ret;
> + if (flags & IOCB_NOWAIT)
> + return -EAGAIN;
>
> return __file_update_time(file, &now, ret);
> }
> +
> +/**
> + * file_modified - handle mandated vfs changes when modifying a file
> + * @file: file that was modified
> + *
> + * When file has been modified ensure that special
> + * file privileges are removed and time settings are updated.
> + *
> + * Context: Caller must hold the file's inode lock.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int file_modified(struct file *file)
> +{
> + return file_modified_flags(file, 0);
> +}
> EXPORT_SYMBOL(file_modified);
>
> +/**
> + * kiocb_modified - handle mandated vfs changes when modifying a file
> + * @iocb: iocb that was modified
> + *
> + * When file has been modified ensure that special
> + * file privileges are removed and time settings are updated.
> + *
> + * Context: Caller must hold the file's inode lock.
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int kiocb_modified(struct kiocb *iocb)
> +{
> + return file_modified_flags(iocb->ki_filp, iocb->ki_flags);
> +}
> +EXPORT_SYMBOL_GPL(kiocb_modified);
> +
> int inode_needs_sync(struct inode *inode)
> {
> if (IS_SYNC(inode))
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bdf1ce48a458..553e57ec3efa 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2392,6 +2392,7 @@ static inline void file_accessed(struct file *file)
> }
>
> extern int file_modified(struct file *file);
> +int kiocb_modified(struct kiocb *iocb);
>
> int sync_inode_metadata(struct inode *inode, int wait);
>
> --
> 2.30.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 10/15] fs: Add async write file modification handling.
2022-06-01 21:01 ` [PATCH v7 10/15] fs: Add async write file modification handling Stefan Roesch
2022-06-02 8:44 ` Jan Kara
@ 2022-06-02 9:06 ` Jan Kara
2022-06-02 21:00 ` Stefan Roesch
1 sibling, 1 reply; 32+ messages in thread
From: Jan Kara @ 2022-06-02 9:06 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe, Christoph Hellwig
On Wed 01-06-22 14:01:36, Stefan Roesch wrote:
> This adds a file_modified_async() function to return -EAGAIN if the
> request either requires to remove privileges or needs to update the file
> modification time. This is required for async buffered writes, so the
> request gets handled in the io worker of io-uring.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
I've found one small bug here:
> diff --git a/fs/inode.c b/fs/inode.c
> index c44573a32c6a..4503bed063e7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
...
> -int file_modified(struct file *file)
> +static int file_modified_flags(struct file *file, int flags)
> {
> int ret;
> struct inode *inode = file_inode(file);
We need to use 'flags' for __file_remove_privs_flags() call in this patch.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 10/15] fs: Add async write file modification handling.
2022-06-02 9:06 ` Jan Kara
@ 2022-06-02 21:00 ` Stefan Roesch
2022-06-03 10:12 ` Jan Kara
0 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-02 21:00 UTC (permalink / raw)
To: Jan Kara
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
hch, axboe, Christoph Hellwig
On 6/2/22 2:06 AM, Jan Kara wrote:
> On Wed 01-06-22 14:01:36, Stefan Roesch wrote:
>> This adds a file_modified_async() function to return -EAGAIN if the
>> request either requires to remove privileges or needs to update the file
>> modification time. This is required for async buffered writes, so the
>> request gets handled in the io worker of io-uring.
>>
>> Signed-off-by: Stefan Roesch <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>
> I've found one small bug here:
>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index c44573a32c6a..4503bed063e7 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
> ...
>> -int file_modified(struct file *file)
>> +static int file_modified_flags(struct file *file, int flags)
>> {
>> int ret;
>> struct inode *inode = file_inode(file);
>
> We need to use 'flags' for __file_remove_privs_flags() call in this patch.
>
I assume that you meant that the function should not be called _file_remove_privs(),
but instead file_remove_privs_flags(). Is that correct?
This would need to be changed in patch 8: "fs: add __remove_file_privs() with flags parameter"
> Honza
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 10/15] fs: Add async write file modification handling.
2022-06-02 21:00 ` Stefan Roesch
@ 2022-06-03 10:12 ` Jan Kara
0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-06-03 10:12 UTC (permalink / raw)
To: Stefan Roesch
Cc: Jan Kara, io-uring, kernel-team, linux-mm, linux-xfs,
linux-fsdevel, david, hch, axboe, Christoph Hellwig
On Thu 02-06-22 14:00:38, Stefan Roesch wrote:
>
>
> On 6/2/22 2:06 AM, Jan Kara wrote:
> > On Wed 01-06-22 14:01:36, Stefan Roesch wrote:
> >> This adds a file_modified_async() function to return -EAGAIN if the
> >> request either requires to remove privileges or needs to update the file
> >> modification time. This is required for async buffered writes, so the
> >> request gets handled in the io worker of io-uring.
> >>
> >> Signed-off-by: Stefan Roesch <[email protected]>
> >> Reviewed-by: Christoph Hellwig <[email protected]>
> >
> > I've found one small bug here:
> >
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index c44573a32c6a..4503bed063e7 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> > ...
> >> -int file_modified(struct file *file)
> >> +static int file_modified_flags(struct file *file, int flags)
> >> {
> >> int ret;
> >> struct inode *inode = file_inode(file);
> >
> > We need to use 'flags' for __file_remove_privs_flags() call in this patch.
> >
>
> I assume that you meant that the function should not be called _file_remove_privs(),
> but instead file_remove_privs_flags(). Is that correct?
No, I meant that patch 8 adds call __file_remove_privs(..., 0) to
file_modified() and this patch then forgets to update that call to
__file_remove_privs(..., flags) so that information propagates properly.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 11/15] fs: Optimization for concurrent file time updates.
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (9 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 10/15] fs: Add async write file modification handling Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 8:59 ` Jan Kara
2022-06-01 21:01 ` [PATCH v7 12/15] io_uring: Add support for async buffered writes Stefan Roesch
` (4 subsequent siblings)
15 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
This introduces the S_PENDING_TIME flag. If an async buffered write
needs to update the time, it cannot be processed in the fast path of
io-uring. When a time update is pending this flag is set for async
buffered writes. Other concurrent async buffered writes for the same
file do not need to wait while this time update is pending.
This reduces the number of async buffered writes that need to get punted
to the io-workers in io-uring.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/inode.c | 11 +++++++++--
include/linux/fs.h | 3 +++
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 4503bed063e7..7185d860d423 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2150,10 +2150,17 @@ static int file_modified_flags(struct file *file, int flags)
ret = inode_needs_update_time(inode, &now);
if (ret <= 0)
return ret;
- if (flags & IOCB_NOWAIT)
+ if (flags & IOCB_NOWAIT) {
+ if (IS_PENDING_TIME(inode))
+ return 0;
+
+ inode_set_flags(inode, S_PENDING_TIME, S_PENDING_TIME);
return -EAGAIN;
+ }
- return __file_update_time(file, &now, ret);
+ ret = __file_update_time(file, &now, ret);
+ inode_set_flags(inode, 0, S_PENDING_TIME);
+ return ret;
}
/**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 553e57ec3efa..15f9a7beba55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2151,6 +2151,8 @@ struct super_operations {
#define S_CASEFOLD (1 << 15) /* Casefolded file */
#define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
#define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PENDING_TIME (1 << 18) /* File update time is pending */
+
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2193,6 +2195,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
+#define IS_PENDING_TIME(inode) ((inode)->i_flags & S_PENDING_TIME)
#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 11/15] fs: Optimization for concurrent file time updates.
2022-06-01 21:01 ` [PATCH v7 11/15] fs: Optimization for concurrent file time updates Stefan Roesch
@ 2022-06-02 8:59 ` Jan Kara
0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-06-02 8:59 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe
On Wed 01-06-22 14:01:37, Stefan Roesch wrote:
> This introduces the S_PENDING_TIME flag. If an async buffered write
> needs to update the time, it cannot be processed in the fast path of
> io-uring. When a time update is pending this flag is set for async
> buffered writes. Other concurrent async buffered writes for the same
> file do not need to wait while this time update is pending.
>
> This reduces the number of async buffered writes that need to get punted
> to the io-workers in io-uring.
>
> Signed-off-by: Stefan Roesch <[email protected]>
Thinking about this, there is a snag with this S_PENDING_TIME scheme. It
can happen that we report write as completed to userspace before timestamps
are actually updated. So following stat(2) can still return old time stamp
which might confuse some userspace application. It might be even nastier
with i_version which is used by NFS and can thus cause data consistency
issues for NFS.
Honza
> ---
> fs/inode.c | 11 +++++++++--
> include/linux/fs.h | 3 +++
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 4503bed063e7..7185d860d423 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2150,10 +2150,17 @@ static int file_modified_flags(struct file *file, int flags)
> ret = inode_needs_update_time(inode, &now);
> if (ret <= 0)
> return ret;
> - if (flags & IOCB_NOWAIT)
> + if (flags & IOCB_NOWAIT) {
> + if (IS_PENDING_TIME(inode))
> + return 0;
> +
> + inode_set_flags(inode, S_PENDING_TIME, S_PENDING_TIME);
> return -EAGAIN;
> + }
>
> - return __file_update_time(file, &now, ret);
> + ret = __file_update_time(file, &now, ret);
> + inode_set_flags(inode, 0, S_PENDING_TIME);
> + return ret;
> }
>
> /**
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 553e57ec3efa..15f9a7beba55 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2151,6 +2151,8 @@ struct super_operations {
> #define S_CASEFOLD (1 << 15) /* Casefolded file */
> #define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */
> #define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
> +#define S_PENDING_TIME (1 << 18) /* File update time is pending */
> +
>
> /*
> * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -2193,6 +2195,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
> #define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
> #define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
> #define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
> +#define IS_PENDING_TIME(inode) ((inode)->i_flags & S_PENDING_TIME)
>
> #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
> (inode)->i_rdev == WHITEOUT_DEV)
> --
> 2.30.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 12/15] io_uring: Add support for async buffered writes
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (10 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 11/15] fs: Optimization for concurrent file time updates Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 13/15] io_uring: Add tracepoint for short writes Stefan Roesch
` (3 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
This enables the async buffered writes for the filesystems that support
async buffered writes in io-uring. Buffered writes are enabled for
blocks that are already in the page cache or can be acquired with noio.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9f1c682d7caf..c0771e215669 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4257,7 +4257,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
return -EINVAL;
}
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
{
return req->flags & REQ_F_ISREG ||
S_ISBLK(file_inode(req->file)->i_mode);
@@ -4386,7 +4386,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
} else if (ret == -EIOCBQUEUED) {
goto out_free;
} else if (ret == req->cqe.res || ret <= 0 || !force_nonblock ||
- (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+ (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
/* read all, failed, already did sync or don't want to retry */
goto done;
}
@@ -4482,9 +4482,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!io_file_supports_nowait(req)))
goto copy_iov;
- /* file path doesn't support NOWAIT for non-direct_IO */
- if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
- (req->flags & REQ_F_ISREG))
+ /* File path supports NOWAIT for non-direct_IO only for block devices. */
+ if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+ !(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+ (req->flags & REQ_F_ISREG))
goto copy_iov;
kiocb->ki_flags |= IOCB_NOWAIT;
@@ -4538,6 +4539,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
/* IOPOLL retry should happen for io-wq threads */
if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
goto copy_iov;
+
+ if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
+ struct io_async_rw *rw;
+
+ /* This is a partial write. The file pos has already been
+ * updated, setup the async struct to complete the request
+ * in the worker. Also update bytes_done to account for
+ * the bytes already written.
+ */
+ iov_iter_save_state(&s->iter, &s->iter_state);
+ ret = io_setup_async_rw(req, iovec, s, true);
+
+ rw = req->async_data;
+ if (rw)
+ rw->bytes_done += ret2;
+
+ return ret ? ret : -EAGAIN;
+ }
done:
kiocb_done(req, ret2, issue_flags);
} else {
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 13/15] io_uring: Add tracepoint for short writes
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (11 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 12/15] io_uring: Add support for async buffered writes Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-01 21:01 ` [PATCH v7 14/15] xfs: Specify lockmode when calling xfs_ilock_for_iomap() Stefan Roesch
` (2 subsequent siblings)
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe
This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.
Signed-off-by: Stefan Roesch <[email protected]>
---
fs/io_uring.c | 3 +++
include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c0771e215669..9ab68138f442 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4543,6 +4543,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
if (ret2 != req->cqe.res && ret2 >= 0 && need_complete_io(req)) {
struct io_async_rw *rw;
+ trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+ req->cqe.res, ret2);
+
/* This is a partial write. The file pos has already been
* updated, setup the async struct to complete the request
* in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 66fcc5a1a5b1..25df513660cc 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -600,6 +600,31 @@ TRACE_EVENT(io_uring_cqe_overflow,
__entry->cflags, __entry->ocqe)
);
+TRACE_EVENT(io_uring_short_write,
+
+ TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+ TP_ARGS(ctx, fpos, wanted, got),
+
+ TP_STRUCT__entry(
+ __field(void *, ctx)
+ __field(u64, fpos)
+ __field(u64, wanted)
+ __field(u64, got)
+ ),
+
+ TP_fast_assign(
+ __entry->ctx = ctx;
+ __entry->fpos = fpos;
+ __entry->wanted = wanted;
+ __entry->got = got;
+ ),
+
+ TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+ __entry->ctx, __entry->fpos,
+ __entry->wanted, __entry->got)
+);
+
#endif /* _TRACE_IO_URING_H */
/* This part must be outside protection */
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v7 14/15] xfs: Specify lockmode when calling xfs_ilock_for_iomap()
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (12 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 13/15] io_uring: Add tracepoint for short writes Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 16:25 ` Darrick J. Wong
2022-06-01 21:01 ` [PATCH v7 15/15] xfs: Add async buffered write support Stefan Roesch
2022-06-02 8:09 ` [PATCH v7 00/15] io-uring/xfs: support async buffered writes Jens Axboe
15 siblings, 1 reply; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This patch changes the helper function xfs_ilock_for_iomap such that the
lock mode must be passed in.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_iomap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..bcf7c3694290 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -664,7 +664,7 @@ xfs_ilock_for_iomap(
unsigned flags,
unsigned *lockmode)
{
- unsigned mode = XFS_ILOCK_SHARED;
+ unsigned int mode = *lockmode;
bool is_write = flags & (IOMAP_WRITE | IOMAP_ZERO);
/*
@@ -742,7 +742,7 @@ xfs_direct_write_iomap_begin(
int nimaps = 1, error = 0;
bool shared = false;
u16 iomap_flags = 0;
- unsigned lockmode;
+ unsigned int lockmode = XFS_ILOCK_SHARED;
ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
@@ -1172,7 +1172,7 @@ xfs_read_iomap_begin(
xfs_fileoff_t end_fsb = xfs_iomap_end_fsb(mp, offset, length);
int nimaps = 1, error = 0;
bool shared = false;
- unsigned lockmode;
+ unsigned int lockmode = XFS_ILOCK_SHARED;
ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 14/15] xfs: Specify lockmode when calling xfs_ilock_for_iomap()
2022-06-01 21:01 ` [PATCH v7 14/15] xfs: Specify lockmode when calling xfs_ilock_for_iomap() Stefan Roesch
@ 2022-06-02 16:25 ` Darrick J. Wong
0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2022-06-02 16:25 UTC (permalink / raw)
To: Stefan Roesch
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david,
jack, hch, axboe, Christoph Hellwig
On Wed, Jun 01, 2022 at 02:01:40PM -0700, Stefan Roesch wrote:
> This patch changes the helper function xfs_ilock_for_iomap such that the
> lock mode must be passed in.
>
> Signed-off-by: Stefan Roesch <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
LGTM
Reviewed-by: Darrick J. Wong <[email protected]>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v7 15/15] xfs: Add async buffered write support
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (13 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 14/15] xfs: Specify lockmode when calling xfs_ilock_for_iomap() Stefan Roesch
@ 2022-06-01 21:01 ` Stefan Roesch
2022-06-02 8:09 ` [PATCH v7 00/15] io-uring/xfs: support async buffered writes Jens Axboe
15 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-01 21:01 UTC (permalink / raw)
To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel
Cc: shr, david, jack, hch, axboe, Christoph Hellwig
This adds the async buffered write support to XFS. For async buffered
write requests, the request will return -EAGAIN if the ilock cannot be
obtained immediately.
Signed-off-by: Stefan Roesch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_file.c | 11 +++++------
fs/xfs/xfs_iomap.c | 5 ++++-
2 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a60632ecc3f0..4d65ff007c7d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -410,7 +410,7 @@ xfs_file_write_checks(
spin_unlock(&ip->i_flags_lock);
out:
- return file_modified(file);
+ return kiocb_modified(iocb);
}
static int
@@ -700,12 +700,11 @@ xfs_file_buffered_write(
bool cleared_space = false;
unsigned int iolock;
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EOPNOTSUPP;
-
write_retry:
iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, iolock);
+ ret = xfs_ilock_iocb(iocb, iolock);
+ if (ret)
+ return ret;
ret = xfs_file_write_checks(iocb, from, &iolock);
if (ret)
@@ -1165,7 +1164,7 @@ xfs_file_open(
{
if (xfs_is_shutdown(XFS_M(inode->i_sb)))
return -EIO;
- file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+ file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
return generic_file_open(inode, file);
}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index bcf7c3694290..5d50fed291b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -886,6 +886,7 @@ xfs_buffered_write_iomap_begin(
bool eof = false, cow_eof = false, shared = false;
int allocfork = XFS_DATA_FORK;
int error = 0;
+ unsigned int lockmode = XFS_ILOCK_EXCL;
if (xfs_is_shutdown(mp))
return -EIO;
@@ -897,7 +898,9 @@ xfs_buffered_write_iomap_begin(
ASSERT(!XFS_IS_REALTIME_INODE(ip));
- xfs_ilock(ip, XFS_ILOCK_EXCL);
+ error = xfs_ilock_for_iomap(ip, flags, &lockmode);
+ if (error)
+ return error;
if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
--
2.30.2
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v7 00/15] io-uring/xfs: support async buffered writes
2022-06-01 21:01 [PATCH v7 00/15] io-uring/xfs: support async buffered writes Stefan Roesch
` (14 preceding siblings ...)
2022-06-01 21:01 ` [PATCH v7 15/15] xfs: Add async buffered write support Stefan Roesch
@ 2022-06-02 8:09 ` Jens Axboe
2022-06-03 2:43 ` Dave Chinner
15 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-06-02 8:09 UTC (permalink / raw)
To: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs,
linux-fsdevel
Cc: david, jack, hch
On 6/1/22 3:01 PM, Stefan Roesch wrote:
> This patch series adds support for async buffered writes when using both
> xfs and io-uring. Currently io-uring only supports buffered writes in the
> slow path, by processing them in the io workers. With this patch series it is
> now possible to support buffered writes in the fast path. To be able to use
> the fast path the required pages must be in the page cache, the required locks
> in xfs can be granted immediately and no additional blocks need to be read
> form disk.
This series looks good to me now, but will need some slight rebasing
since the 5.20 io_uring branch has split up the code a bit. Trivial to
do though, I suspect it'll apply directly if we just change
fs/io_uring.c to io_uring/rw.c instead.
The bigger question is how to stage this, as it's touching a bit of fs,
mm, and io_uring...
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 00/15] io-uring/xfs: support async buffered writes
2022-06-02 8:09 ` [PATCH v7 00/15] io-uring/xfs: support async buffered writes Jens Axboe
@ 2022-06-03 2:43 ` Dave Chinner
2022-06-03 13:04 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2022-06-03 2:43 UTC (permalink / raw)
To: Jens Axboe
Cc: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs,
linux-fsdevel, jack, hch
On Thu, Jun 02, 2022 at 02:09:00AM -0600, Jens Axboe wrote:
> On 6/1/22 3:01 PM, Stefan Roesch wrote:
> > This patch series adds support for async buffered writes when using both
> > xfs and io-uring. Currently io-uring only supports buffered writes in the
> > slow path, by processing them in the io workers. With this patch series it is
> > now possible to support buffered writes in the fast path. To be able to use
> > the fast path the required pages must be in the page cache, the required locks
> > in xfs can be granted immediately and no additional blocks need to be read
> > form disk.
>
> This series looks good to me now, but will need some slight rebasing
> since the 5.20 io_uring branch has split up the code a bit. Trivial to
> do though, I suspect it'll apply directly if we just change
> fs/io_uring.c to io_uring/rw.c instead.
>
> The bigger question is how to stage this, as it's touching a bit of fs,
> mm, and io_uring...
What data integrity testing has this had? Has it been run through a
few billion fsx operations with w/ io_uring read/write enabled?
Cheers,
Dave.
--
Dave Chinner
[email protected]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 00/15] io-uring/xfs: support async buffered writes
2022-06-03 2:43 ` Dave Chinner
@ 2022-06-03 13:04 ` Jens Axboe
2022-06-07 16:41 ` Stefan Roesch
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2022-06-03 13:04 UTC (permalink / raw)
To: Dave Chinner
Cc: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs,
linux-fsdevel, jack, hch
On 6/2/22 8:43 PM, Dave Chinner wrote:
> On Thu, Jun 02, 2022 at 02:09:00AM -0600, Jens Axboe wrote:
>> On 6/1/22 3:01 PM, Stefan Roesch wrote:
>>> This patch series adds support for async buffered writes when using both
>>> xfs and io-uring. Currently io-uring only supports buffered writes in the
>>> slow path, by processing them in the io workers. With this patch series it is
>>> now possible to support buffered writes in the fast path. To be able to use
>>> the fast path the required pages must be in the page cache, the required locks
>>> in xfs can be granted immediately and no additional blocks need to be read
>>> form disk.
>>
>> This series looks good to me now, but will need some slight rebasing
>> since the 5.20 io_uring branch has split up the code a bit. Trivial to
>> do though, I suspect it'll apply directly if we just change
>> fs/io_uring.c to io_uring/rw.c instead.
>>
>> The bigger question is how to stage this, as it's touching a bit of fs,
>> mm, and io_uring...
>
> What data integrity testing has this had? Has it been run through a
> few billion fsx operations with w/ io_uring read/write enabled?
I'll let Stefan expand on this, but just mention what I know - it has
been fun via fio at least. Each of the performance tests were hour long
each, and also specific test cases were written to test the boundary
conditions of what pages of a range where in page cache, etc. Also with
data verification.
Don't know if fsx specifically has been used it.
--
Jens Axboe
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v7 00/15] io-uring/xfs: support async buffered writes
2022-06-03 13:04 ` Jens Axboe
@ 2022-06-07 16:41 ` Stefan Roesch
0 siblings, 0 replies; 32+ messages in thread
From: Stefan Roesch @ 2022-06-07 16:41 UTC (permalink / raw)
To: Jens Axboe, Dave Chinner
Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, jack,
hch
On 6/3/22 6:04 AM, Jens Axboe wrote:
> On 6/2/22 8:43 PM, Dave Chinner wrote:
>> On Thu, Jun 02, 2022 at 02:09:00AM -0600, Jens Axboe wrote:
>>> On 6/1/22 3:01 PM, Stefan Roesch wrote:
>>>> This patch series adds support for async buffered writes when using both
>>>> xfs and io-uring. Currently io-uring only supports buffered writes in the
>>>> slow path, by processing them in the io workers. With this patch series it is
>>>> now possible to support buffered writes in the fast path. To be able to use
>>>> the fast path the required pages must be in the page cache, the required locks
>>>> in xfs can be granted immediately and no additional blocks need to be read
>>>> form disk.
>>>
>>> This series looks good to me now, but will need some slight rebasing
>>> since the 5.20 io_uring branch has split up the code a bit. Trivial to
>>> do though, I suspect it'll apply directly if we just change
>>> fs/io_uring.c to io_uring/rw.c instead.
>>>
>>> The bigger question is how to stage this, as it's touching a bit of fs,
>>> mm, and io_uring...
>>
>> What data integrity testing has this had? Has it been run through a
>> few billion fsx operations with w/ io_uring read/write enabled?
>
> I'll let Stefan expand on this, but just mention what I know - it has
> been fun via fio at least. Each of the performance tests were hour long
> each, and also specific test cases were written to test the boundary
> conditions of what pages of a range where in page cache, etc. Also with
> data verification.
>
I performed the following tests:
- fio tests with various block sizes and different modes (psysnc, io_uring, libaio)
- fsx tests with one billion ops
- individual test program
- to test with different block sizes
- test short writes
- test holes
- test without readahead
> Don't know if fsx specifically has been used it.
>
^ permalink raw reply [flat|nested] 32+ messages in thread