public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: "David Hildenbrand (arm)" <david@kernel.org>,
	Gabriel Krisman Bertazi <krisman@suse.de>
Cc: io-uring@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/2] Introduce IORING_OP_MMAP
Date: Mon, 2 Feb 2026 07:34:20 -0700	[thread overview]
Message-ID: <01839e70-5a71-4969-ad5f-2495754250e1@kernel.dk> (raw)
In-Reply-To: <6a351a3a-861a-4b93-8d8a-c0f5b87c258f@kernel.org>

On 2/2/26 2:02 AM, David Hildenbrand (arm) wrote:
> On 2/1/26 19:16, Jens Axboe wrote:
>> On 2/1/26 10:46 AM, David Hildenbrand (arm) wrote:
>>> On 1/29/26 23:11, Gabriel Krisman Bertazi wrote:
>>>> Hi,
>>>>
>>>> There's been a few requests over time for supporting mmap(2) over
>>>> io_uring. The reasoning are twofold: 1) serving as base for batching
>>>> multiple mappings in a single operation 2) supporting mmap of fixed
>>>> files.
>>>>
>>>> Since mmap can operate on either anonymous memory and file descriptors,
>>>> patch 1 adds support for optional fds in io_uring commands.  Patch 2
>>>> implements the mmap operation itself.
>>>>
>>>> Note this patchset doesn't do any kind of smarter batching in MM.  While
>>>> we can potentially do some interesting optimizations already, like
>>>> holding the MM write lock instead of reacquiring it for each mapping, I
>>>> wanted to focus on the API discussion first.  This is left as future
>>>> work.
>>>>
>>>> liburing support, including testcases, will be sent shortly to the list,
>>>> but can also be found at:
>>>
>>> Just a general question: why do we unlock each syscall individually,
>>> and not in some intelligent way, all syscalls at once? :)
>>
>> The hard part isn't enabling all syscalls at once, that could be
>> trivially done with an IORING_OP_SYSCALL and the SQE carries arg0..argN.
>> And for any nonblocking/simple syscall, that would Just Work. 
> 
> Right, that's what I had in mind.
> 
>> The
>> challenge is for syscalls that block - the whole point of io_uring is
>> that you should be able to do nonblock issues with sane retries. The
>> futex series I did some time back is a good example of that - you modify
>> the existing syscall to expose the waitqueue mechanism, which you can
>> then use to wait in an async way, and get a callback when some action
>> needs to be taken.
>>
>> If you just allow blocking, then you're blocking the entire io_uring
>> issue pipeline. Which was exactly my main complaint on this patchset,
>> see the review reply to patch 2.
> 
> Makes sense. I was wondering whether that could be optimized
> internally in the stream of IORING_OP_SYSCALL.
> 
> But likely that would make it more tricky to optimize.

Are we talking generically, or mmap/munmap/mremap? You could trivially
make IORING_OP_SYSCALL available and use it for everything, it'd just
require a basically all of those to be offloaded to io-wq internally in
io_uring. And that's not a great approach. The fast path for io_uring is
running the opcode inline, which means that by the time the syscall
returns, you have also posted the completion. If the operation can't
complete inline, then the next best thing is to have it be triggered
when it can complete, and then retry and post the completion. Think of
reading from a pipe - if the data is there, the read is done inside
io_uring_enter() when the read is attempted, and we're done. If no data
is available, the operation is queued. When data becomes available, a
retry is triggered, data is read, and a completion is posted.

For an old school kind of syscall "do this thing, and just block the
task until it's done" doesn't work that way at all. Running those in
io_uring would necessitate punting the operation to io-wq, which are
helper userspace threads for io_uring. As there's no way of knowing
whether syscallN will complete fast inline or block for 2 seconds,
io_uring has no other option than to offload it to io-wq. If it's a 2
second operation, that's fine, you won't see any difference in the
application, other than it can now do syscallN async in an efficient
way. If syscallN would've completed inline in 1 usec, then offloading to
io-wq is suddenly a big performance problem.

> The patch set says "serving as base for batching
> multiple mappings in a single operation", and I was wondering, why one wouldn't just also batch with mremap/munmap/ etc. in the future.
> 
> (BUT I am also skeptical whether holding the mmap lock in write mode
> longer instead of repeatedly grabbing it, allowing other operations
> that need it in read mode etc to make progress, is actually
> preferrable)

That's always a trade off - if the frequency is high, then a certain
level of batching makes sense. The good news is that you get to control
that, you can just batch more or less.

Outside of mmap locking frequencies, I suspect potentially nicer wins
might be around TLB flush reductions for this family of operations.

-- 
Jens Axboe

      reply	other threads:[~2026-02-02 14:34 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-29 22:11 [PATCH 0/2] Introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 1/2] io_uring: Support commands with optional file descriptors Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 2/2] io_uring: introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-30  6:03   ` kernel test robot
2026-01-30 15:47     ` Gabriel Krisman Bertazi
2026-01-30 15:55   ` Jens Axboe
2026-02-01 17:46 ` [PATCH 0/2] Introduce IORING_OP_MMAP David Hildenbrand (arm)
2026-02-01 18:16   ` Jens Axboe
2026-02-02  9:02     ` David Hildenbrand (arm)
2026-02-02 14:34       ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=01839e70-5a71-4969-ad5f-2495754250e1@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@kernel.org \
    --cc=io-uring@vger.kernel.org \
    --cc=krisman@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox