From: "David Hildenbrand (arm)" <david@kernel.org>
To: Jens Axboe <axboe@kernel.dk>, Gabriel Krisman Bertazi <krisman@suse.de>
Cc: io-uring@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
linux-mm@kvack.org
Subject: Re: [PATCH 0/2] Introduce IORING_OP_MMAP
Date: Wed, 4 Feb 2026 20:47:52 +0100 [thread overview]
Message-ID: <7faa5721-cd73-4140-9d63-fa5a279dbce3@kernel.org> (raw)
In-Reply-To: <01839e70-5a71-4969-ad5f-2495754250e1@kernel.dk>
On 2/2/26 15:34, Jens Axboe wrote:
> On 2/2/26 2:02 AM, David Hildenbrand (arm) wrote:
>> On 2/1/26 19:16, Jens Axboe wrote:
>>>
>>> The hard part isn't enabling all syscalls at once, that could be
>>> trivially done with an IORING_OP_SYSCALL and the SQE carries arg0..argN.
>>> And for any nonblocking/simple syscall, that would Just Work.
>>
>> Right, that's what I had in mind.
>>
>>> The
>>> challenge is for syscalls that block - the whole point of io_uring is
>>> that you should be able to do nonblock issues with sane retries. The
>>> futex series I did some time back is a good example of that - you modify
>>> the existing syscall to expose the waitqueue mechanism, which you can
>>> then use to wait in an async way, and get a callback when some action
>>> needs to be taken.
>>>
>>> If you just allow blocking, then you're blocking the entire io_uring
>>> issue pipeline. Which was exactly my main complaint on this patchset,
>>> see the review reply to patch 2.
>>
>> Makes sense. I was wondering whether that could be optimized
>> internally in the stream of IORING_OP_SYSCALL.
>>
>> But likely that would make it more tricky to optimize.
>
> Are we talking generically, or mmap/munmap/mremap?
Well, a bit of both :)
munmap() could be a bit challenging as it downgrades the mmap_lock for
removal of the page tables. So quite a bit of rework would be required
to batch that over multiple operations I suppose.
> You could trivially
> make IORING_OP_SYSCALL available and use it for everything, it'd just
> require a basically all of those to be offloaded to io-wq internally in
> io_uring. And that's not a great approach. The fast path for io_uring is
> running the opcode inline, which means that by the time the syscall
> returns, you have also posted the completion. If the operation can't
> complete inline, then the next best thing is to have it be triggered
> when it can complete, and then retry and post the completion. Think of
> reading from a pipe - if the data is there, the read is done inside
> io_uring_enter() when the read is attempted, and we're done. If no data
> is available, the operation is queued. When data becomes available, a
> retry is triggered, data is read, and a completion is posted.
Thanks for the explanation.
>
> For an old school kind of syscall "do this thing, and just block the
> task until it's done" doesn't work that way at all. Running those in
> io_uring would necessitate punting the operation to io-wq, which are
> helper userspace threads for io_uring. As there's no way of knowing
> whether syscallN will complete fast inline or block for 2 seconds,
> io_uring has no other option than to offload it to io-wq. If it's a 2
> second operation, that's fine, you won't see any difference in the
> application, other than it can now do syscallN async in an efficient
> way. If syscallN would've completed inline in 1 usec, then offloading to
> io-wq is suddenly a big performance problem.
>
>> The patch set says "serving as base for batching
>> multiple mappings in a single operation", and I was wondering, why one wouldn't just also batch with mremap/munmap/ etc. in the future.
>>
>> (BUT I am also skeptical whether holding the mmap lock in write mode
>> longer instead of repeatedly grabbing it, allowing other operations
>> that need it in read mode etc to make progress, is actually
>> preferrable)
>
> That's always a trade off - if the frequency is high, then a certain
> level of batching makes sense. The good news is that you get to control
> that, you can just batch more or less.
>
> Outside of mmap locking frequencies, I suspect potentially nicer wins
> might be around TLB flush reductions for this family of operations.
For mremap() and munmap(), yes, just like for MADV_DONTNEED.
mmap() maybe if we do a MAP_FIXED that implies an munmap() IIRC.
But then we are again in "hairy to reasonably batch" territory I think.
These are all extremely involved operations.
Is there any use case for the patch set at hand, in particular, in an
un-optimized form?
--
Cheers,
David
prev parent reply other threads:[~2026-02-04 19:47 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-29 22:11 [PATCH 0/2] Introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 1/2] io_uring: Support commands with optional file descriptors Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 2/2] io_uring: introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-30 6:03 ` kernel test robot
2026-01-30 15:47 ` Gabriel Krisman Bertazi
2026-01-30 15:55 ` Jens Axboe
2026-02-01 17:46 ` [PATCH 0/2] Introduce IORING_OP_MMAP David Hildenbrand (arm)
2026-02-01 18:16 ` Jens Axboe
2026-02-02 9:02 ` David Hildenbrand (arm)
2026-02-02 14:34 ` Jens Axboe
2026-02-04 19:47 ` David Hildenbrand (arm) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7faa5721-cd73-4140-9d63-fa5a279dbce3@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=io-uring@vger.kernel.org \
--cc=krisman@suse.de \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox