* [GIT PULL] io_uring changes for 5.9-rc1
@ 2020-08-02 21:41 Jens Axboe
2020-08-03 20:48 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2020-08-02 21:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: io-uring, [email protected]
[-- Attachment #1: Type: text/plain, Size: 9735 bytes --]
Hi Linus,
Lots of cleanups in here, hardening the code and/or making it easier to
read and fixing buts, but a core feature/change too adding support for
real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for the
fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries from
task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking other
bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)
This will throw a few merge conflicts. One is due to the IOCB_NOIO
addition that happened late in 5.8-rc, the other is due to a change in
for-5.9/block. Both are trivial to fixup, I'm attaching my merge
resolution when I pulled it in locally.
Please pull!
The following changes since commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5:
io_uring: fix lockup in io_fail_links() (2020-07-24 12:51:33 -0600)
are available in the Git repository at:
git://git.kernel.dk/linux-block.git tags/for-5.9/io_uring-20200802
for you to fetch changes up to fa15bafb71fd7a4d6018dae87cfaf890fd4ab47f:
io_uring: flip if handling after io_setup_async_rw (2020-08-01 11:02:57 -0600)
----------------------------------------------------------------
for-5.9/io_uring-20200802
----------------------------------------------------------------
Bijan Mottahedeh (4):
io_uring: add wrappers for memory accounting
io_uring: rename ctx->account_mem field
io_uring: report pinned memory usage
io_uring: separate reporting of ring pages from registered pages
Dan Carpenter (1):
io_uring: fix a use after free in io_async_task_func()
Dmitry Vyukov (1):
io_uring: fix sq array offset calculation
Jens Axboe (31):
block: provide plug based way of signaling forced no-wait semantics
io_uring: always plug for any number of IOs
io_uring: catch -EIO from buffered issue request failure
io_uring: re-issue block requests that failed because of resources
mm: allow read-ahead with IOCB_NOWAIT set
mm: abstract out wake_page_match() from wake_page_function()
mm: add support for async page locking
mm: support async buffered reads in generic_file_buffered_read()
fs: add FMODE_BUF_RASYNC
block: flag block devices as supporting IOCB_WAITQ
xfs: flag files as supporting buffered async reads
btrfs: flag files as supporting buffered async reads
mm: add kiocb_wait_page_queue_init() helper
io_uring: support true async buffered reads, if file provides it
Merge branch 'async-buffered.8' into for-5.9/io_uring
io_uring: provide generic io_req_complete() helper
io_uring: add 'io_comp_state' to struct io_submit_state
io_uring: pass down completion state on the issue side
io_uring: pass in completion state to appropriate issue side handlers
io_uring: enable READ/WRITE to use deferred completions
io_uring: use task_work for links if possible
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: clean up io_kill_linked_timeout() locking
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: abstract out task work running
io_uring: use new io_req_task_work_add() helper throughout
io_uring: only call kfree() for a non-zero pointer
io_uring: get rid of __req_need_defer()
io_uring: remove dead 'ctx' argument and move forward declaration
Merge branch 'io_uring-5.8' into for-5.9/io_uring
io_uring: don't touch 'ctx' after installing file descriptor
Jiufei Xue (2):
io_uring: change the poll type to be 32-bits
io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior
Pavel Begunkov (90):
io_uring: remove setting REQ_F_MUST_PUNT in rw
io_uring: remove REQ_F_MUST_PUNT
io_uring: set @poll->file after @poll init
io_uring: kill NULL checks for submit state
io_uring: fix NULL-mm for linked reqs
io-wq: compact io-wq flags numbers
io-wq: return next work from ->do_work() directly
io_uring: fix req->work corruption
io_uring: fix punting req w/o grabbed env
io_uring: fix feeding io-wq with uninit reqs
io_uring: don't mark link's head for_async
io_uring: fix missing io_grab_files()
io_uring: fix refs underflow in io_iopoll_queue()
io_uring: remove inflight batching in free_many()
io_uring: dismantle req early and remove need_iter
io_uring: batch-free linked requests as well
io_uring: cosmetic changes for batch free
io_uring: kill REQ_F_LINK_NEXT
io_uring: clean up req->result setting by rw
io_uring: do task_work_run() during iopoll
io_uring: fix iopoll -EAGAIN handling
io_uring: fix missing wake_up io_rw_reissue()
io_uring: deduplicate freeing linked timeouts
io_uring: replace find_next() out param with ret
io_uring: kill REQ_F_TIMEOUT
io_uring: kill REQ_F_TIMEOUT_NOSEQ
io_uring: fix potential use after free on fallback request free
io_uring: don't pass def into io_req_work_grab_env
io_uring: do init work in grab_env()
io_uring: factor out grab_env() from defer_prep()
io_uring: do grab_env() just before punting
io_uring: don't fail iopoll requeue without ->mm
io_uring: fix NULL mm in io_poll_task_func()
io_uring: simplify io_async_task_func()
io_uring: optimise io_req_find_next() fast check
io_uring: fix missing ->mm on exit
io_uring: fix mis-refcounting linked timeouts
io_uring: keep queue_sqe()'s fail path separately
io_uring: fix lost cqe->flags
io_uring: don't delay iopoll'ed req completion
io_uring: fix stopping iopoll'ing too early
io_uring: briefly loose locks while reaping events
io_uring: partially inline io_iopoll_getevents()
io_uring: remove nr_events arg from iopoll_check()
io_uring: don't burn CPU for iopoll on exit
io_uring: rename sr->msg into umsg
io_uring: use more specific type in rcv/snd msg cp
io_uring: extract io_sendmsg_copy_hdr()
io_uring: replace rw->task_work with rq->task_work
io_uring: simplify io_req_map_rw()
io_uring: add a helper for async rw iovec prep
io_uring: follow **iovec idiom in io_import_iovec
io_uring: share completion list w/ per-op space
io_uring: rename ctx->poll into ctx->iopoll
io_uring: use inflight_entry list for iopoll'ing
io_uring: use completion list for CQ overflow
io_uring: add req->timeout.list
io_uring: remove init for unused list
io_uring: use non-intrusive list for defer
io_uring: remove sequence from io_kiocb
io_uring: place cflags into completion data
io_uring: inline io_req_work_grab_env()
io_uring: remove empty cleanup of OP_OPEN* reqs
io_uring: alloc ->io in io_req_defer_prep()
io_uring/io-wq: move RLIMIT_FSIZE to io-wq
io_uring: simplify file ref tracking in submission state
io_uring: indent left {send,recv}[msg]()
io_uring: remove extra checks in send/recv
io_uring: don't forget cflags in io_recv()
io_uring: free selected-bufs if error'ed
io_uring: move BUFFER_SELECT check into *recv[msg]
io_uring: extract io_put_kbuf() helper
io_uring: don't open-code recv kbuf managment
io_uring: don't miscount pinned memory
io_uring: return locked and pinned page accounting
tasks: add put_task_struct_many()
io_uring: batch put_task_struct()
io_uring: don't do opcode prep twice
io_uring: deduplicate io_grab_files() calls
io_uring: mark ->work uninitialised after cleanup
io_uring: fix missing io_queue_linked_timeout()
io-wq: update hash bits
io_uring: de-unionise io_kiocb
io_uring: deduplicate __io_complete_rw()
io_uring: fix racy overflow count reporting
io_uring: fix stalled deferred requests
io_uring: consolidate *_check_overflow accounting
io_uring: get rid of atomic FAA for cq_timeouts
fs: optimise kiocb_set_rw_flags()
io_uring: flip if handling after io_setup_async_rw
Randy Dunlap (1):
io_uring: fix function args for !CONFIG_NET
Xiaoguang Wang (1):
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
block/blk-core.c | 6 +
fs/block_dev.c | 2 +-
fs/btrfs/file.c | 2 +-
fs/io-wq.c | 14 +-
fs/io-wq.h | 11 +-
fs/io_uring.c | 2588 +++++++++++++++++++++++------------------
fs/xfs/xfs_file.c | 2 +-
include/linux/blkdev.h | 1 +
include/linux/fs.h | 26 +-
include/linux/pagemap.h | 75 ++
include/linux/sched/task.h | 6 +
include/uapi/linux/io_uring.h | 4 +-
mm/filemap.c | 110 +-
tools/io_uring/liburing.h | 6 +-
14 files changed, 1658 insertions(+), 1195 deletions(-)
--
Jens Axboe
[-- Attachment #2: merge.txt --]
[-- Type: text/plain, Size: 3460 bytes --]
commit 32a5169a5562db6a09a2d85164e0079913ecc227
Merge: 5fb023fb414a fa15bafb71fd
Author: Jens Axboe <[email protected]>
Date: Sun Aug 2 10:43:35 2020 -0600
Merge branch 'for-5.9/io_uring' into test
* for-5.9/io_uring: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
Signed-off-by: Jens Axboe <[email protected]>
diff --cc block/blk-core.c
index 93104c7470e8,62a4904db921..d9d632639bd1
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@@ -956,13 -952,30 +956,18 @@@ static inline blk_status_t blk_check_zo
return BLK_STS_OK;
}
-static noinline_for_stack bool
-generic_make_request_checks(struct bio *bio)
+static noinline_for_stack bool submit_bio_checks(struct bio *bio)
{
- struct request_queue *q;
- int nr_sectors = bio_sectors(bio);
+ struct request_queue *q = bio->bi_disk->queue;
blk_status_t status = BLK_STS_IOERR;
+ struct blk_plug *plug;
- char b[BDEVNAME_SIZE];
might_sleep();
- q = bio->bi_disk->queue;
- if (unlikely(!q)) {
- printk(KERN_ERR
- "generic_make_request: Trying to access "
- "nonexistent block-device %s (%Lu)\n",
- bio_devname(bio, b), (long long)bio->bi_iter.bi_sector);
- goto end_io;
- }
-
+ plug = blk_mq_plug(q, bio);
+ if (plug && plug->nowait)
+ bio->bi_opf |= REQ_NOWAIT;
+
/*
* For a REQ_NOWAIT based request, return -EOPNOTSUPP
* if queue is not a request based queue.
diff --cc include/linux/fs.h
index 41cd993ec0f6,e535543d31d9..b7f1f1b7d691
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@@ -315,7 -318,8 +318,9 @@@ enum rw_hint
#define IOCB_SYNC (1 << 5)
#define IOCB_WRITE (1 << 6)
#define IOCB_NOWAIT (1 << 7)
+ /* iocb->ki_waitq is valid */
+ #define IOCB_WAITQ (1 << 8)
+#define IOCB_NOIO (1 << 9)
struct kiocb {
struct file *ki_filp;
diff --cc mm/filemap.c
index 385759c4ce4b,a5b1fa8f7ce4..4e39c1f4c7d9
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@@ -2028,8 -2044,6 +2044,8 @@@ find_page
page = find_get_page(mapping, index);
if (!page) {
- if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
++ if (iocb->ki_flags & IOCB_NOIO)
+ goto would_block;
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
@@@ -2164,7 -2185,7 +2191,7 @@@ page_not_up_to_date_locked
}
readpage:
- if (iocb->ki_flags & IOCB_NOIO) {
- if (iocb->ki_flags & IOCB_NOWAIT) {
++ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) {
unlock_page(page);
put_page(page);
goto would_block;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-02 21:41 [GIT PULL] io_uring changes for 5.9-rc1 Jens Axboe @ 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 22:30 ` Jens Axboe 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 20:48 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, [email protected] On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <[email protected]> wrote: > > Lots of cleanups in here, hardening the code and/or making it easier to > read and fixing buts, but a core feature/change too adding support for > real async buffered reads. With the latter in place, we just need > buffered write async support and we're done relying on kthreads for the > fast path. In detail: That async buffered reads handling the the page locking flag is a mess, and I'm really happy I committed my page locking scalability change early, so that the conflicts there caught it. Re-using the page bit waitqueue types and exporting them? That part is fine, I guess, particularly since it came from the wait_bit_key thing and have a smell of being generic. Taking a random part of wake_page_function(), and calling it "wake_page_match()" even though that's not at all that it does? Not ok. Adding random kiocb helper functions to a core header file, when they are only used in one place, and when they only make sense in that one place? Not ok. When the function is called "wake_page_match()", you'd expect it matches the wake page information, wouldn't it? Yeah, it did that. And then it also checked whether the bit we're waiting had been set again, because everybody ostensibly wanted that. Except they don't any more, and that's not what the name really implied anyway. And kiocb_wait_page_queue_init() has absolutely zero business being in <linux/filemap.h>. There are absolutely no valid uses of that thing outside of the one place that calls it. I tried to fix up the things I could. That said, like a lot of io_uring code, this is some seriously opaque code. You say you've done a lot of cleanups, but I'm not convinced those cleanups are in any way offsetting adding yet another union (how many bugs did the last one add?) and a magic flag of "use this part of the union" now. And I don't know what loads you use for testing that thing, or what happens when the "lock_page_async()" case actually fails to lock, and just calls back the io_async_buf_func() wakeup function when the page has unlocked... That function doesn't actually lock the page either, but does the task work. I hope that work then knows to do the right thing, but it's really opaque and hard to follow. Anyway, I'm not entirely happy with doing these kinds of changes in the merge resolution, but the alternative was to not do the pull at all, and require you to do a lot of cleanups before I would pull it. Maybe I should have done that. So this is a slightly grumpy email about how io_uring is (a) still making me very nervous about a very lackadaisical approach to things, and having the codepaths so obscure that I'm not convinced it's not horribly buggy. And (b) I fixed things up without really being able to test them. I tested that the _normal_ paths still seem to work fine, but.. I really think that whole thing needs a lot of comments, particularly around the whole io_rw_should_retry() area. A bit and legible comment about how it will be caught by the generic_file_buffered_read() page locking code, how the two cases differ (it might get caught by the "I'm just waiting for it to be unlocked", but it could *also* get caught by the "lock page now" case), and how it continues the request. As it is, it bounces between the generic code and very io_uring specific code in strange and not easy to follow ways. I've pushed out my merge of this thing, but you might also want to take a look at commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic"). I particular, the comment about how there's no point in even testing the page bit any more when you get woken up. I left that if (test_bit(key->bit_nr, &key->page->flags)) return -1; logic in io_async_buf_func() (but it's not in "wake_page_match()" any more), but I suspect it's bogus and pointless, for the same reason that it isn't done for normal page waits now. Maybe it's better to just queue the actual work regardless, it will then be caught in the _real_ lock_page() or whatever it ends up doing - and if it only really wants to see the "uptodate" bit being set, and was just waiting for IO to finish, then it never really cared about the page lock bit at all, it just wanted to be notified about IO being done. So this was a really long email to tell you - again - that I'm not happy with how fragile io_uring is, and how the code seems to be almost intentionally written to *be* fragile. Complex and hard to understand, and as a result it has had a fairly high rate of fairly nasty bugs. I'm hoping this isn't going to be yet another case of "nasty bugs because of complexity and a total disregard for explaining what is going on". Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:48 ` Linus Torvalds @ 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:30 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 20:53 UTC (permalink / raw) To: Jens Axboe, Konstantin Ryabitsev; +Cc: io-uring, [email protected] On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds <[email protected]> wrote: > > I've pushed out my merge of this thing [..] It seems I'm not the only one unhappy with the pull request. For some reason I also don't see pr-tracker-bot being all happy and excited about it. I wonder why. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:53 ` Linus Torvalds @ 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:31 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Konstantin Ryabitsev @ 2020-08-03 21:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, io-uring, [email protected] On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds > <[email protected]> wrote: > > > > I've pushed out my merge of this thing [..] > > It seems I'm not the only one unhappy with the pull request. > > For some reason I also don't see pr-tracker-bot being all happy and > excited about it. I wonder why. My guess it's because the body consists of two text/plain MIME-parts and Python returned the merge.txt part first, where we didn't find what we were looking for. I'll see if I can teach it to walk all text/plain parts looking for magic git pull strings instead of giving up after the first one. -K ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 21:10 ` Konstantin Ryabitsev @ 2020-08-03 22:31 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 22:31 UTC (permalink / raw) To: Konstantin Ryabitsev, Linus Torvalds Cc: io-uring, [email protected] On 8/3/20 3:10 PM, Konstantin Ryabitsev wrote: > On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote: >> On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds >> <[email protected]> wrote: >>> >>> I've pushed out my merge of this thing [..] >> >> It seems I'm not the only one unhappy with the pull request. >> >> For some reason I also don't see pr-tracker-bot being all happy and >> excited about it. I wonder why. > > My guess it's because the body consists of two text/plain MIME-parts and > Python returned the merge.txt part first, where we didn't find what we > were looking for. > > I'll see if I can teach it to walk all text/plain parts looking for > magic git pull strings instead of giving up after the first one. Thanks, I was a bit puzzled on that one too, and this time it definitely wasn't because the tag wasn't there. In terms of attachments, I'm usually a fan of inlining, but seemed cleaner to me to attach the merge resolution as there's already a ton of other stuff in that email. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds @ 2020-08-03 22:30 ` Jens Axboe 2020-08-03 23:18 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 22:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, [email protected] On 8/3/20 2:48 PM, Linus Torvalds wrote: > On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <[email protected]> wrote: >> >> Lots of cleanups in here, hardening the code and/or making it easier to >> read and fixing buts, but a core feature/change too adding support for >> real async buffered reads. With the latter in place, we just need >> buffered write async support and we're done relying on kthreads for the >> fast path. In detail: > > That async buffered reads handling the the page locking flag is a > mess, and I'm really happy I committed my page locking scalability > change early, so that the conflicts there caught it. > > Re-using the page bit waitqueue types and exporting them? > > That part is fine, I guess, particularly since it came from the > wait_bit_key thing and have a smell of being generic. > > Taking a random part of wake_page_function(), and calling it > "wake_page_match()" even though that's not at all that it does? > > Not ok. OK, I actually thought it was kind of nice and better than having that code duplicated in two spots. > Adding random kiocb helper functions to a core header file, when they > are only used in one place, and when they only make sense in that one > place? > > Not ok. I'll move that into io_uring instead. > When the function is called "wake_page_match()", you'd expect it > matches the wake page information, wouldn't it? > > Yeah, it did that. And then it also checked whether the bit we're > waiting had been set again, because everybody ostensibly wanted that. > Except they don't any more, and that's not what the name really > implied anyway. > > And kiocb_wait_page_queue_init() has absolutely zero business being in > <linux/filemap.h>. There are absolutely no valid uses of that thing > outside of the one place that calls it. > > I tried to fix up the things I could. Thanks! As mentioned, I'll prep a cleanup patch that moves the kiocb_wait_page_queue_init() out of there. > That said, like a lot of io_uring code, this is some seriously opaque > code. You say you've done a lot of cleanups, but I'm not convinced > those cleanups are in any way offsetting adding yet another union (how > many bugs did the last one add?) and a magic flag of "use this part of > the union" now. I had to think a bit about what you are referring to here, but I guess it's the iocb part. And yes, it's not ideal, but until we support polled IO with buffered IO, then there's no overlap in the use case. And I don't see us ever doing that. > And I don't know what loads you use for testing that thing, or what > happens when the "lock_page_async()" case actually fails to lock, and > just calls back the io_async_buf_func() wakeup function when the page > has unlocked... > > That function doesn't actually lock the page either, but does the task > work. I hope that work then knows to do the right thing, but it's > really opaque and hard to follow. The task work retries the whole thing, so it'll go through the normal page cache read path again with all that that entails. We only really ever use the callback to tell us when it's a good idea to retry again, there's no other retained state there at all. I didn't realize that part wasn't straight forward, so I'll add some comments as well explaining how that code flow works. It's seen a good amount of testing, both from myself and also from others. The postgres IO rewrite has been putting it through its paces, and outside of a few initial issues months ago, it's been rock solid. > Anyway, I'm not entirely happy with doing these kinds of changes in > the merge resolution, but the alternative was to not do the pull at > all, and require you to do a lot of cleanups before I would pull it. > Maybe I should have done that. > > So this is a slightly grumpy email about how io_uring is (a) still > making me very nervous about a very lackadaisical approach to things, > and having the codepaths so obscure that I'm not convinced it's not > horribly buggy. And (b) I fixed things up without really being able to > test them. I tested that the _normal_ paths still seem to work fine, > but.. I need to do a better job at commenting these parts, obviously. And while nothing is perfect, and we're definitely perfect yet, the general trend is definitely strongly towards getting rid of odd states through flags and unifying more of the code, and tons of fixes/cleanups that make things easier to read and verify... > I really think that whole thing needs a lot of comments, particularly > around the whole io_rw_should_retry() area. > > A bit and legible comment about how it will be caught by the > generic_file_buffered_read() page locking code, how the two cases > differ (it might get caught by the "I'm just waiting for it to be > unlocked", but it could *also* get caught by the "lock page now" > case), and how it continues the request. Noted, I'll write that up. > As it is, it bounces between the generic code and very io_uring > specific code in strange and not easy to follow ways. > > I've pushed out my merge of this thing, but you might also want to > take a look at commit 2a9127fcf229 ("mm: rewrite > wait_on_page_bit_common() logic"). I particular, the comment about how > there's no point in even testing the page bit any more when you get > woken up. > > I left that > > if (test_bit(key->bit_nr, &key->page->flags)) > return -1; > > logic in io_async_buf_func() (but it's not in "wake_page_match()" any > more), but I suspect it's bogus and pointless, for the same reason > that it isn't done for normal page waits now. Maybe it's better to > just queue the actual work regardless, it will then be caught in the > _real_ lock_page() or whatever it ends up doing - and if it only > really wants to see the "uptodate" bit being set, and was just waiting > for IO to finish, then it never really cared about the page lock bit > at all, it just wanted to be notified about IO being done. I just did notice your rewrite commit, and I'll adjust accordingly and test it with that too. > So this was a really long email to tell you - again - that I'm not > happy with how fragile io_uring is, and how the code seems to be > almost intentionally written to *be* fragile. Complex and hard to > understand, and as a result it has had a fairly high rate of fairly > nasty bugs. > > I'm hoping this isn't going to be yet another case of "nasty bugs > because of complexity and a total disregard for explaining what is > going on". Outside of the review from Johannes, lots of other people did look over the async buffered bits, and Andrew as well said it look good to him. So while the task_work retry apparently isn't as obvious as I had hoped, it's definitely not fragile or intentionally trying to be obtuse. I'll make a few adjustments based on your feedback, and add a patch with some comments as well. Hopefully that'll make the end result easier to follow. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 22:30 ` Jens Axboe @ 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:34 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, [email protected] On 8/3/20 4:30 PM, Jens Axboe wrote: >> Adding random kiocb helper functions to a core header file, when they >> are only used in one place, and when they only make sense in that one >> place? >> >> Not ok. > > I'll move that into io_uring instead. I see that you handled most of the complaints already, so thanks for that. I've run some basic testing with master and it works for me, running some more testing on production too. I took a look at the rewrite you queued up, and made a matching change on the io_uring side: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9 and also queued a documentation patch for the retry logic and the callback handler: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 For the latter, let me know if you're happy with the explanation, or if you want other parts documented more thoroughly too. I'll make a pass through the file in any case once I've flushed out the other branches for this merge window in. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:18 ` Jens Axboe @ 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:34 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, [email protected] On 8/3/20 5:18 PM, Jens Axboe wrote: > On 8/3/20 4:30 PM, Jens Axboe wrote: >>> Adding random kiocb helper functions to a core header file, when they >>> are only used in one place, and when they only make sense in that one >>> place? >>> >>> Not ok. >> >> I'll move that into io_uring instead. > > I see that you handled most of the complaints already, so thanks for > that. I've run some basic testing with master and it works for me, > running some more testing on production too. > > I took a look at the rewrite you queued up, and made a matching change > on the io_uring side: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9 Updated to honor exclusive return value as well: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=f6fd3784c9f7d3309507fdb6dcc818f54467bf3e -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:31 ` Jens Axboe @ 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:56 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 23:49 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, [email protected] On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <[email protected]> wrote: > > Updated to honor exclusive return value as well: See my previous email, You're just adding code that makes no sense, because your wait entry fundamentally isn't an exclusive one. So all that code is a no-op and only makes it more confusing to read. Your wakeup handler has _nothing_ to do with the generic wake_page_function(). There is _zero_ overlap. Your wakeup handler gets called only for the wait entries _you_ created. Trying to use the wakeup logic from wake_page_function() makes no sense, because the rules for wake_page_function() are entirely different. Yes, they are called for the same thing (somebody unlocked a page and is waking up waiters), but it's using a completely different sleeping logic. See? When wake_page_function() does that wait->flags |= WQ_FLAG_WOKEN; and does something different (and returns different values) depending on whether WQ_FLAG_EXCLUSIVE was set, that is all because wait_on_page_bit_common() entry set yo that wait entry (on its stack) with those exact rules in mind. So the wakeup function is 1:1 tied to the code that registers the wait entry. wait_on_page_bit_common() has one set of rules, that are then honored by the wakeup function it uses. But those rules have _zero_ impact on your use. You can have - and you *do* have - different sets of rules. For example, none of your wakeups are ever exclusive. All you do is make a work runnable - that doesn't mean that other people shouldn't do other things when they get a "page was unlocked" wakeup notification. Also, for you "list_del_init()" is fine, because you never do the unlocked "list_empty_careful()" on that wait entry. All the waitqueue operations run under the queue head lock. So what I think you _should_ do is just something like this: diff --git a/fs/io_uring.c b/fs/io_uring.c index 2a3af95be4ca..1e243f99643b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode, if (!wake_page_match(wpq, key)) return 0; - /* Stop waking things up if the page is locked again */ - if (test_bit(key->bit_nr, &key->page->flags)) - return -1; - + /* + * Somebody unlocked the page. Unqueue the wait entry + * and run the task_work + */ list_del_init(&wait->entry); init_task_work(&req->task_work, io_req_task_submit); because that matches what you're actually doing. There's no reason to stop waking up others because the page is locked, because you don't know what others want. And there's never any reason for the exclusive thing, b3ecause none of what you do guarantees that you take exclusive ownership of the page lock. Running the work *may* end up doing a "lock_page()", but you don't actually guarantee that. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:49 ` Linus Torvalds @ 2020-08-03 23:56 ` Jens Axboe 2020-08-04 0:11 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, [email protected] On 8/3/20 5:49 PM, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <[email protected]> wrote: >> >> Updated to honor exclusive return value as well: > > See my previous email, You're just adding code that makes no sense, > because your wait entry fundamentally isn't an exclusive one. Right, I get that now, it's just dead code for my use case. It was sent out before your previous email. > So all that code is a no-op and only makes it more confusing to read. > > Your wakeup handler has _nothing_ to do with the generic > wake_page_function(). There is _zero_ overlap. Your wakeup handler > gets called only for the wait entries _you_ created. > > Trying to use the wakeup logic from wake_page_function() makes no > sense, because the rules for wake_page_function() are entirely > different. Yes, they are called for the same thing (somebody unlocked > a page and is waking up waiters), but it's using a completely > different sleeping logic. > > See? When wake_page_function() does that > > wait->flags |= WQ_FLAG_WOKEN; > > and does something different (and returns different values) depending > on whether WQ_FLAG_EXCLUSIVE was set, that is all because > wait_on_page_bit_common() entry set yo that wait entry (on its stack) > with those exact rules in mind. > > So the wakeup function is 1:1 tied to the code that registers the wait > entry. wait_on_page_bit_common() has one set of rules, that are then > honored by the wakeup function it uses. But those rules have _zero_ > impact on your use. You can have - and you *do* have - different sets > of rules. > > For example, none of your wakeups are ever exclusive. All you do is > make a work runnable - that doesn't mean that other people shouldn't > do other things when they get a "page was unlocked" wakeup > notification. > > Also, for you "list_del_init()" is fine, because you never do the > unlocked "list_empty_careful()" on that wait entry. All the waitqueue > operations run under the queue head lock. > > So what I think you _should_ do is just something like this: > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2a3af95be4ca..1e243f99643b 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct > wait_queue_entry *wait, unsigned mode, > if (!wake_page_match(wpq, key)) > return 0; > > - /* Stop waking things up if the page is locked again */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > - > + /* > + * Somebody unlocked the page. Unqueue the wait entry > + * and run the task_work > + */ > list_del_init(&wait->entry); > > init_task_work(&req->task_work, io_req_task_submit); > > because that matches what you're actually doing. > > There's no reason to stop waking up others because the page is locked, > because you don't know what others want. > > And there's never any reason for the exclusive thing, b3ecause none of > what you do guarantees that you take exclusive ownership of the page > lock. Running the work *may* end up doing a "lock_page()", but you > don't actually guarantee that. What I ended up with after the last email was just removing the test bit: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32 and I clarified the comments on the io_async_buf_func() to add more hints on how everything is triggered instead of just a vague "handler" reference: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455 -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:56 ` Jens Axboe @ 2020-08-04 0:11 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2020-08-04 0:11 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, [email protected] On Mon, Aug 3, 2020 at 4:56 PM Jens Axboe <[email protected]> wrote: > > What I ended up with after the last email was just removing the test > bit: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32 > > and I clarified the comments on the io_async_buf_func() to add more > hints on how everything is triggered instead of just a vague "handler" > reference: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455 These both look sensible to me now. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe @ 2020-08-03 23:34 ` Linus Torvalds 2020-08-03 23:43 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 23:34 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, [email protected] On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <[email protected]> wrote: > > > I took a look at the rewrite you queued up, and made a matching change > on the io_uring side: Oh, no, you made it worse. Now you're tying your odd wakeup routine to entirely irrelevant things that can't even happen to you. That io_async_buf_func() will never be called for any entry that isn't your own, so testing wait->flags & WQ_FLAG_EXCLUSIVE is completely pointless, because you never set that flag. And similarly, for you to then do wait->flags |= WQ_FLAG_WOKEN; is equally pointless, because the only thing that cares and looks at that wait entry is you, and you don't care about the WOKEN flag. So that patch shows a fundamental misunderstanding of how the waitqueues actually work. Which is kind of my _point_. The io_uring code that hooked into the page wait queues really looks like complete cut-and-paste voodoo programming. It needs comments. It's hard to follow. Even somebody like me, who actually knows how the page wait queues really work, have a really hard time following how io_uring initializing a wait-queue entry and pointing to it in the io ctx then interacts with the (later) generic file reading path, and how it then calls back at unlock time to the io_uring callback _if_ the page was locked. And that patch you point to makes me 100% sure you don't quite understand the code either. So when you do /* * Only test the bit if it's an exclusive wait, as we know the * bit is cleared for non-exclusive waits. Also see mm/filemap.c */ if ((wait->flags & WQ_FLAG_EXCLUSIVE) && test_and_set_bit(key->bit_nr, &key->page->flags)) return -1; the first test guarantees that the second test is never done, which is good, because if it *had* been done, you'd have taken the lock and nothing you have actually expects that. So the fix is to just remove those lines entirely. If somebody unlocked the page you care about, and did a wakeup on that page and bit, then you know you should start the async worker. Noi amount of testing bits matters at all. And similarly, the wait->flags |= WQ_FLAG_WOKEN; is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait entry is _your_ wait entry. It's not the wait entry of some normal page locker - those use wake_page_function(). Now *if* you had workers that actually expected to be woken up with the page lock already held, and owning it, then that kind of WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But that's not what you have. > and also queued a documentation patch for the retry logic and the > callback handler: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 Better. Although I find the first comment a bit misleading. You say /* Invoked from our "page is now unlocked" handler when someone .. but that's not really the case. The function gets called by whoever unlocks the page after you've registered that page wait entry through lock_page_async(). So there's no "our handler" anywhere, which I find misleading and confusing in the comment. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:34 ` Linus Torvalds @ 2020-08-03 23:43 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, [email protected] On 8/3/20 5:34 PM, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <[email protected]> wrote: >> >> >> I took a look at the rewrite you queued up, and made a matching change >> on the io_uring side: > > Oh, no, you made it worse. > > Now you're tying your odd wakeup routine to entirely irrelevant things > that can't even happen to you. > > That io_async_buf_func() will never be called for any entry that isn't > your own, so testing > > wait->flags & WQ_FLAG_EXCLUSIVE > > is completely pointless, because you never set that flag. And > similarly, for you to then do > > wait->flags |= WQ_FLAG_WOKEN; > > is equally pointless, because the only thing that cares and looks at > that wait entry is you, and you don't care about the WOKEN flag. > > So that patch shows a fundamental misunderstanding of how the > waitqueues actually work. > > Which is kind of my _point_. The io_uring code that hooked into the > page wait queues really looks like complete cut-and-paste voodoo > programming. > > It needs comments. It's hard to follow. Even somebody like me, who > actually knows how the page wait queues really work, have a really > hard time following how io_uring initializing a wait-queue entry and > pointing to it in the io ctx then interacts with the (later) generic > file reading path, and how it then calls back at unlock time to the > io_uring callback _if_ the page was locked. > > And that patch you point to makes me 100% sure you don't quite > understand the code either. > > So when you do > > /* > * Only test the bit if it's an exclusive wait, as we know the > * bit is cleared for non-exclusive waits. Also see mm/filemap.c > */ > if ((wait->flags & WQ_FLAG_EXCLUSIVE) && > test_and_set_bit(key->bit_nr, &key->page->flags)) > return -1; > > the first test guarantees that the second test is never done, which is > good, because if it *had* been done, you'd have taken the lock and > nothing you have actually expects that. > > So the fix is to just remove those lines entirely. If somebody > unlocked the page you care about, and did a wakeup on that page and > bit, then you know you should start the async worker. Noi amount of > testing bits matters at all. > > And similarly, the > > wait->flags |= WQ_FLAG_WOKEN; > > is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait > entry is _your_ wait entry. It's not the wait entry of some normal > page locker - those use wake_page_function(). > > Now *if* you had workers that actually expected to be woken up with > the page lock already held, and owning it, then that kind of > WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But > that's not what you have. Yes, looks like I was a bit too trigger happy without grokking the whole thing, and got it mixed up with the broader more generic waitqueue cases. Thanks for clueing me in, I've updated the patch so the use case is inline with only what io_uring is doing here. >> and also queued a documentation patch for the retry logic and the >> callback handler: >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 > > Better. Although I find the first comment a bit misleading. > > You say > > /* Invoked from our "page is now unlocked" handler when someone .. > > but that's not really the case. The function gets called by whoever > unlocks the page after you've registered that page wait entry through > lock_page_async(). > > So there's no "our handler" anywhere, which I find misleading and > confusing in the comment. The 'handler' refers to the io_uring waitqueue callback, but I should probably spell that out. I'll adjust it. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-08-04 0:12 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-08-02 21:41 [GIT PULL] io_uring changes for 5.9-rc1 Jens Axboe 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:31 ` Jens Axboe 2020-08-03 22:30 ` Jens Axboe 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:56 ` Jens Axboe 2020-08-04 0:11 ` Linus Torvalds 2020-08-03 23:34 ` Linus Torvalds 2020-08-03 23:43 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox