bcachefs: suspicious mm pointer in struct dio

public inbox for [email protected]
 help / color / mirror / Atom feed

* bcachefs: suspicious mm pointer in struct dio_write
@ 2024-11-27 16:57 Jann Horn
  2024-11-27 18:09 ` Jens Axboe
  2024-11-27 20:23 ` Kent Overstreet
  0 siblings, 2 replies; 17+ messages in thread
From: Jann Horn @ 2024-11-27 16:57 UTC (permalink / raw)
  To: Kent Overstreet, linux-bcachefs
  Cc: kernel list, Jens Axboe, Pavel Begunkov, io-uring

Hi!

In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
to an mm_struct. This pointer is grabbed in bch2_direct_write()
(without any kind of refcount increment), and used in
bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
which are used to enable userspace memory access from kthread context.
I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
guarantees that the MM hasn't gone through exit_mmap() yet (normally
by holding an mmget() reference).

If we reach this codepath via io_uring, do we have a guarantee that
the mm_struct that called bch2_direct_write() is still alive and
hasn't yet gone through exit_mmap() when it is accessed from
bch2_dio_write_continue()?

I don't know the async direct I/O codepath particularly well, so I
cc'ed the uring maintainers, who probably know this better than me.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn
@ 2024-11-27 18:09 ` Jens Axboe
  2024-11-27 19:43   ` Jann Horn
  2024-11-27 20:25   ` Kent Overstreet
  2024-11-27 20:23 ` Kent Overstreet
  1 sibling, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 18:09 UTC (permalink / raw)
  To: Jann Horn, Kent Overstreet, linux-bcachefs
  Cc: kernel list, Pavel Begunkov, io-uring

On 11/27/24 9:57 AM, Jann Horn wrote:
> Hi!
> 
> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> (without any kind of refcount increment), and used in
> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> which are used to enable userspace memory access from kthread context.
> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> by holding an mmget() reference).
> 
> If we reach this codepath via io_uring, do we have a guarantee that
> the mm_struct that called bch2_direct_write() is still alive and
> hasn't yet gone through exit_mmap() when it is accessed from
> bch2_dio_write_continue()?
> 
> I don't know the async direct I/O codepath particularly well, so I
> cc'ed the uring maintainers, who probably know this better than me.

I _think_ this is fine as-is, even if it does look dubious and bcachefs
arguably should grab an mm ref for this just for safety to avoid future
problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
means that on the io_uring side it cannot do non-blocking issue of
requests. This is slower as it always punts to an io-wq thread, which
shares the same mm. Hence if the request is alive, there's always a
thread with the same mm alive as well.

Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
to dig a bit deeper to verify that would always be safe and there's not
a of time today with a few days off in the US looming, so I'll defer
that to next week. It certainly would be fine with an mm ref grabbed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 18:09 ` Jens Axboe
@ 2024-11-27 19:43   ` Jann Horn
  2024-11-27 20:01     ` Jens Axboe
  2024-11-27 20:25   ` Kent Overstreet
  1 sibling, 1 reply; 17+ messages in thread
From: Jann Horn @ 2024-11-27 19:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov,
	io-uring

On Wed, Nov 27, 2024 at 7:09 PM Jens Axboe <[email protected]> wrote:
> On 11/27/24 9:57 AM, Jann Horn wrote:
> > Hi!
> >
> > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > (without any kind of refcount increment), and used in
> > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > which are used to enable userspace memory access from kthread context.
> > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > by holding an mmget() reference).
> >
> > If we reach this codepath via io_uring, do we have a guarantee that
> > the mm_struct that called bch2_direct_write() is still alive and
> > hasn't yet gone through exit_mmap() when it is accessed from
> > bch2_dio_write_continue()?
> >
> > I don't know the async direct I/O codepath particularly well, so I
> > cc'ed the uring maintainers, who probably know this better than me.
>
> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> arguably should grab an mm ref for this just for safety to avoid future
> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> means that on the io_uring side it cannot do non-blocking issue of
> requests. This is slower as it always punts to an io-wq thread, which
> shares the same mm. Hence if the request is alive, there's always a
> thread with the same mm alive as well.
>
> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> to dig a bit deeper to verify that would always be safe and there's not
> a of time today with a few days off in the US looming, so I'll defer
> that to next week. It certainly would be fine with an mm ref grabbed.

Ah, thanks for looking into it! I missed this implication of not
setting FMODE_NOWAIT.

Anyway, what you said sounds like it would be cleaner for bcachefs to
grab its own extra reference, maybe by initially grabbing an mm
reference with mmgrab() in bch2_direct_write(), and then use
mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is
stable.

What do other file systems do for this? I think they normally grab
page references so that they don't need the MM anymore when
asynchronously fulfilling the request, right? Like in
iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab
references to the pages corresponding to the userspace regions in
dio->submit.iter?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 19:43   ` Jann Horn
@ 2024-11-27 20:01     ` Jens Axboe
  2024-11-27 20:31       ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 20:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov,
	io-uring

On 11/27/24 12:43 PM, Jann Horn wrote:
> On Wed, Nov 27, 2024 at 7:09?PM Jens Axboe <[email protected]> wrote:
>> On 11/27/24 9:57 AM, Jann Horn wrote:
>>> Hi!
>>>
>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
>>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
>>> (without any kind of refcount increment), and used in
>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
>>> which are used to enable userspace memory access from kthread context.
>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
>>> by holding an mmget() reference).
>>>
>>> If we reach this codepath via io_uring, do we have a guarantee that
>>> the mm_struct that called bch2_direct_write() is still alive and
>>> hasn't yet gone through exit_mmap() when it is accessed from
>>> bch2_dio_write_continue()?
>>>
>>> I don't know the async direct I/O codepath particularly well, so I
>>> cc'ed the uring maintainers, who probably know this better than me.
>>
>> I _think_ this is fine as-is, even if it does look dubious and bcachefs
>> arguably should grab an mm ref for this just for safety to avoid future
>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
>> means that on the io_uring side it cannot do non-blocking issue of
>> requests. This is slower as it always punts to an io-wq thread, which
>> shares the same mm. Hence if the request is alive, there's always a
>> thread with the same mm alive as well.
>>
>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
>> to dig a bit deeper to verify that would always be safe and there's not
>> a of time today with a few days off in the US looming, so I'll defer
>> that to next week. It certainly would be fine with an mm ref grabbed.
> 
> Ah, thanks for looking into it! I missed this implication of not
> setting FMODE_NOWAIT.
> 
> Anyway, what you said sounds like it would be cleaner for bcachefs to
> grab its own extra reference, maybe by initially grabbing an mm
> reference with mmgrab() in bch2_direct_write(), and then use
> mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is
> stable.

Yep I think that would definitely make it more sturdy, and also less
headscratchy in terms of being able to verify it's actually safe.

> What do other file systems do for this? I think they normally grab
> page references so that they don't need the MM anymore when
> asynchronously fulfilling the request, right? Like in
> iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab
> references to the pages corresponding to the userspace regions in
> dio->submit.iter?

Not aware of anything else doing it like this, where it's punted to a
kthread and then the mm used from there. The upfront page
getting/mapping is the common approach, like you described. Which does
seem like a much better choice, rather than needing to rely on the mm in
a kworker.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 20:01     ` Jens Axboe
@ 2024-11-27 20:31       ` Kent Overstreet
  0 siblings, 0 replies; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 20:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 01:01:31PM -0700, Jens Axboe wrote:
> On 11/27/24 12:43 PM, Jann Horn wrote:
> > On Wed, Nov 27, 2024 at 7:09?PM Jens Axboe <[email protected]> wrote:
> >> On 11/27/24 9:57 AM, Jann Horn wrote:
> >>> Hi!
> >>>
> >>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> >>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> >>> (without any kind of refcount increment), and used in
> >>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> >>> which are used to enable userspace memory access from kthread context.
> >>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> >>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> >>> by holding an mmget() reference).
> >>>
> >>> If we reach this codepath via io_uring, do we have a guarantee that
> >>> the mm_struct that called bch2_direct_write() is still alive and
> >>> hasn't yet gone through exit_mmap() when it is accessed from
> >>> bch2_dio_write_continue()?
> >>>
> >>> I don't know the async direct I/O codepath particularly well, so I
> >>> cc'ed the uring maintainers, who probably know this better than me.
> >>
> >> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> >> arguably should grab an mm ref for this just for safety to avoid future
> >> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> >> means that on the io_uring side it cannot do non-blocking issue of
> >> requests. This is slower as it always punts to an io-wq thread, which
> >> shares the same mm. Hence if the request is alive, there's always a
> >> thread with the same mm alive as well.
> >>
> >> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> >> to dig a bit deeper to verify that would always be safe and there's not
> >> a of time today with a few days off in the US looming, so I'll defer
> >> that to next week. It certainly would be fine with an mm ref grabbed.
> > 
> > Ah, thanks for looking into it! I missed this implication of not
> > setting FMODE_NOWAIT.
> > 
> > Anyway, what you said sounds like it would be cleaner for bcachefs to
> > grab its own extra reference, maybe by initially grabbing an mm
> > reference with mmgrab() in bch2_direct_write(), and then use
> > mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is
> > stable.
> 
> Yep I think that would definitely make it more sturdy, and also less
> headscratchy in terms of being able to verify it's actually safe.
> 
> > What do other file systems do for this? I think they normally grab
> > page references so that they don't need the MM anymore when
> > asynchronously fulfilling the request, right? Like in
> > iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab
> > references to the pages corresponding to the userspace regions in
> > dio->submit.iter?
> 
> Not aware of anything else doing it like this, where it's punted to a
> kthread and then the mm used from there. The upfront page
> getting/mapping is the common approach, like you described. Which does
> seem like a much better choice, rather than needing to rely on the mm in
> a kworker.

More common, but not necessarily better: the "pin everything up front"
approach had the disadvantage that - well, you pinned everything up
front: if userspace requests a single multi-gigabyte IO, that is likely
not what you want.

The old dio code didn't pin everything up front for this reason (but
IIRC wasn't fully asynchronous, and also had a silly baked in 64 page
limit).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 18:09 ` Jens Axboe
  2024-11-27 19:43   ` Jann Horn
@ 2024-11-27 20:25   ` Kent Overstreet
  2024-11-27 20:44     ` Jann Horn
  1 sibling, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 20:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> On 11/27/24 9:57 AM, Jann Horn wrote:
> > Hi!
> > 
> > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > (without any kind of refcount increment), and used in
> > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > which are used to enable userspace memory access from kthread context.
> > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > by holding an mmget() reference).
> > 
> > If we reach this codepath via io_uring, do we have a guarantee that
> > the mm_struct that called bch2_direct_write() is still alive and
> > hasn't yet gone through exit_mmap() when it is accessed from
> > bch2_dio_write_continue()?
> > 
> > I don't know the async direct I/O codepath particularly well, so I
> > cc'ed the uring maintainers, who probably know this better than me.
> 
> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> arguably should grab an mm ref for this just for safety to avoid future
> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> means that on the io_uring side it cannot do non-blocking issue of
> requests. This is slower as it always punts to an io-wq thread, which
> shares the same mm. Hence if the request is alive, there's always a
> thread with the same mm alive as well.
> 
> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> to dig a bit deeper to verify that would always be safe and there's not
> a of time today with a few days off in the US looming, so I'll defer
> that to next week. It certainly would be fine with an mm ref grabbed.

Wouldn't delivery of completions be tied to an address space (not a
process) like it is for aio?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 20:25   ` Kent Overstreet
@ 2024-11-27 20:44     ` Jann Horn
  2024-11-27 21:08       ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Jann Horn @ 2024-11-27 20:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 9:25 PM Kent Overstreet
<[email protected]> wrote:
> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> > On 11/27/24 9:57 AM, Jann Horn wrote:
> > > Hi!
> > >
> > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > > to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > > (without any kind of refcount increment), and used in
> > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > > which are used to enable userspace memory access from kthread context.
> > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > > guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > > by holding an mmget() reference).
> > >
> > > If we reach this codepath via io_uring, do we have a guarantee that
> > > the mm_struct that called bch2_direct_write() is still alive and
> > > hasn't yet gone through exit_mmap() when it is accessed from
> > > bch2_dio_write_continue()?
> > >
> > > I don't know the async direct I/O codepath particularly well, so I
> > > cc'ed the uring maintainers, who probably know this better than me.
> >
> > I _think_ this is fine as-is, even if it does look dubious and bcachefs
> > arguably should grab an mm ref for this just for safety to avoid future
> > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> > means that on the io_uring side it cannot do non-blocking issue of
> > requests. This is slower as it always punts to an io-wq thread, which
> > shares the same mm. Hence if the request is alive, there's always a
> > thread with the same mm alive as well.
> >
> > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> > to dig a bit deeper to verify that would always be safe and there's not
> > a of time today with a few days off in the US looming, so I'll defer
> > that to next week. It certainly would be fine with an mm ref grabbed.
>
> Wouldn't delivery of completions be tied to an address space (not a
> process) like it is for aio?

An io_uring instance is primarily exposed to userspace as a file
descriptor, so AFAIK it is possible for the io_uring instance to live
beyond when the last mmput() happens. io_uring initially only holds an
mmgrab() reference on the MM (a comment above that explains: "This is
just grabbed for accounting purposes"), which I think is not enough to
make it stable enough for kthread_use_mm(); I think in io_uring, only
the worker threads actually keep the MM fully alive (and AFAIK the
uring worker threads can exit before the uring instance itself is torn
down).

To receive io_uring completions, there are multiple ways, but they
don't use a pointer from the io_uring instance to the MM to access
userspace memory. Instead, you can have a VMA that points to the
io_uring instance, created by calling mmap() on the io_uring fd; or
alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
references to userspace-provided pages.

On top of that, I think it might currently be possible to use the
io_uring file descriptor from another task to submit work. (That would
probably be fairly nonsensical, but I think the kernel does not
currently prevent it.)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 20:44     ` Jann Horn
@ 2024-11-27 21:08       ` Kent Overstreet
  2024-11-27 21:16         ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 21:08 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote:
> On Wed, Nov 27, 2024 at 9:25 PM Kent Overstreet
> <[email protected]> wrote:
> > On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> > > On 11/27/24 9:57 AM, Jann Horn wrote:
> > > > Hi!
> > > >
> > > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > > > to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > > > (without any kind of refcount increment), and used in
> > > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > > > which are used to enable userspace memory access from kthread context.
> > > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > > > guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > > > by holding an mmget() reference).
> > > >
> > > > If we reach this codepath via io_uring, do we have a guarantee that
> > > > the mm_struct that called bch2_direct_write() is still alive and
> > > > hasn't yet gone through exit_mmap() when it is accessed from
> > > > bch2_dio_write_continue()?
> > > >
> > > > I don't know the async direct I/O codepath particularly well, so I
> > > > cc'ed the uring maintainers, who probably know this better than me.
> > >
> > > I _think_ this is fine as-is, even if it does look dubious and bcachefs
> > > arguably should grab an mm ref for this just for safety to avoid future
> > > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> > > means that on the io_uring side it cannot do non-blocking issue of
> > > requests. This is slower as it always punts to an io-wq thread, which
> > > shares the same mm. Hence if the request is alive, there's always a
> > > thread with the same mm alive as well.
> > >
> > > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> > > to dig a bit deeper to verify that would always be safe and there's not
> > > a of time today with a few days off in the US looming, so I'll defer
> > > that to next week. It certainly would be fine with an mm ref grabbed.
> >
> > Wouldn't delivery of completions be tied to an address space (not a
> > process) like it is for aio?
> 
> An io_uring instance is primarily exposed to userspace as a file
> descriptor, so AFAIK it is possible for the io_uring instance to live
> beyond when the last mmput() happens. io_uring initially only holds an
> mmgrab() reference on the MM (a comment above that explains: "This is
> just grabbed for accounting purposes"), which I think is not enough to
> make it stable enough for kthread_use_mm(); I think in io_uring, only
> the worker threads actually keep the MM fully alive (and AFAIK the
> uring worker threads can exit before the uring instance itself is torn
> down).
> 
> To receive io_uring completions, there are multiple ways, but they
> don't use a pointer from the io_uring instance to the MM to access
> userspace memory. Instead, you can have a VMA that points to the
> io_uring instance, created by calling mmap() on the io_uring fd; or
> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
> references to userspace-provided pages.
> 
> On top of that, I think it might currently be possible to use the
> io_uring file descriptor from another task to submit work. (That would
> probably be fairly nonsensical, but I think the kernel does not
> currently prevent it.)

Ok, that's a wrinkle.

Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
very cursory glance leads me to suspect "no", it seems like this is a
bug if io_uring is allowed on bcachefs at all.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:08       ` Kent Overstreet
@ 2024-11-27 21:16         ` Jens Axboe
  2024-11-27 21:27           ` Kent Overstreet
  2024-11-27 21:39           ` Jann Horn
  0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 21:16 UTC (permalink / raw)
  To: Kent Overstreet, Jann Horn
  Cc: linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On 11/27/24 2:08 PM, Kent Overstreet wrote:
> On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote:
>> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet
>> <[email protected]> wrote:
>>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
>>>> On 11/27/24 9:57 AM, Jann Horn wrote:
>>>>> Hi!
>>>>>
>>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
>>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
>>>>> (without any kind of refcount increment), and used in
>>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
>>>>> which are used to enable userspace memory access from kthread context.
>>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
>>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
>>>>> by holding an mmget() reference).
>>>>>
>>>>> If we reach this codepath via io_uring, do we have a guarantee that
>>>>> the mm_struct that called bch2_direct_write() is still alive and
>>>>> hasn't yet gone through exit_mmap() when it is accessed from
>>>>> bch2_dio_write_continue()?
>>>>>
>>>>> I don't know the async direct I/O codepath particularly well, so I
>>>>> cc'ed the uring maintainers, who probably know this better than me.
>>>>
>>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs
>>>> arguably should grab an mm ref for this just for safety to avoid future
>>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
>>>> means that on the io_uring side it cannot do non-blocking issue of
>>>> requests. This is slower as it always punts to an io-wq thread, which
>>>> shares the same mm. Hence if the request is alive, there's always a
>>>> thread with the same mm alive as well.
>>>>
>>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
>>>> to dig a bit deeper to verify that would always be safe and there's not
>>>> a of time today with a few days off in the US looming, so I'll defer
>>>> that to next week. It certainly would be fine with an mm ref grabbed.
>>>
>>> Wouldn't delivery of completions be tied to an address space (not a
>>> process) like it is for aio?
>>
>> An io_uring instance is primarily exposed to userspace as a file
>> descriptor, so AFAIK it is possible for the io_uring instance to live
>> beyond when the last mmput() happens. io_uring initially only holds an
>> mmgrab() reference on the MM (a comment above that explains: "This is
>> just grabbed for accounting purposes"), which I think is not enough to
>> make it stable enough for kthread_use_mm(); I think in io_uring, only
>> the worker threads actually keep the MM fully alive (and AFAIK the
>> uring worker threads can exit before the uring instance itself is torn
>> down).
>>
>> To receive io_uring completions, there are multiple ways, but they
>> don't use a pointer from the io_uring instance to the MM to access
>> userspace memory. Instead, you can have a VMA that points to the
>> io_uring instance, created by calling mmap() on the io_uring fd; or
>> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
>> references to userspace-provided pages.
>>
>> On top of that, I think it might currently be possible to use the
>> io_uring file descriptor from another task to submit work. (That would
>> probably be fairly nonsensical, but I think the kernel does not
>> currently prevent it.)
> 
> Ok, that's a wrinkle.

I'd argue the fact that you are using an mm from a different process
without grabbing a reference is the wrinkle. I just don't think it's a
problem right now, but it could be... aio is tied to the mm because of
how it does completions, potentially, and hence needs this exit_aio()
hack because of that. aio also doesn't care, because it doesn't care
about blocking - it'll happily block during issue.

> Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> very cursory glance leads me to suspect "no", it seems like this is a
> bug if io_uring is allowed on bcachefs at all.

I just looked at bcachefs dio writes, which look to be the only case of
this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
always involved for the IO.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:16         ` Jens Axboe
@ 2024-11-27 21:27           ` Kent Overstreet
  2024-11-27 21:51             ` Jens Axboe
  2024-11-27 21:39           ` Jann Horn
  1 sibling, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 21:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote:
> I'd argue the fact that you are using an mm from a different process
> without grabbing a reference is the wrinkle. I just don't think it's a
> problem right now, but it could be... aio is tied to the mm because of
> how it does completions, potentially, and hence needs this exit_aio()
> hack because of that. aio also doesn't care, because it doesn't care
> about blocking - it'll happily block during issue.

I'm not trying to debate who's bug it is, I'm just checking if I need to
backport a security fix.

> > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > very cursory glance leads me to suspect "no", it seems like this is a
> > bug if io_uring is allowed on bcachefs at all.
> 
> I just looked at bcachefs dio writes, which look to be the only case of
> this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> always involved for the IO.

Ok, sounds like we're in the clear. I already started writing the patch,
so it'll just be a "now we can turn on FMODE_NOWAIT" instead of a
bugfix.

By the way, did the lifetime issue that was causing umount/remount to
fail ever get resolved? I've currently got no test coverage for
io_uring, would be nice to flip that back on.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:27           ` Kent Overstreet
@ 2024-11-27 21:51             ` Jens Axboe
  2024-11-27 21:58               ` Kent Overstreet
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 21:51 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 2:27?PM Kent Overstreet <[email protected]> wrote:
>
> On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote:
> > I'd argue the fact that you are using an mm from a different process
> > without grabbing a reference is the wrinkle. I just don't think it's a
> > problem right now, but it could be... aio is tied to the mm because of
> > how it does completions, potentially, and hence needs this exit_aio()
> > hack because of that. aio also doesn't care, because it doesn't care
> > about blocking - it'll happily block during issue.
>
> I'm not trying to debate who's bug it is, I'm just checking if I need to
> backport a security fix.

Not trying to place blame.

> > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > > very cursory glance leads me to suspect "no", it seems like this is a
> > > bug if io_uring is allowed on bcachefs at all.
> >
> > I just looked at bcachefs dio writes, which look to be the only case of
> > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> > always involved for the IO.
>
> Ok, sounds like we're in the clear. I already started writing the
> patch, so it'll just be a "now we can turn on FMODE_NOWAIT" instead of
> a bugfix.

That sounds good - and FMODE_NOWAIT will be a good addition. It'll make
RWF_NOWAIT work, and things like io_uring will also work better as it
won't need to needlessly punt to an io-wq worker to complete this IO.

> By the way, did the lifetime issue that was causing umount/remount to
> fail ever get resolved? I've currently got no test coverage for
> io_uring, would be nice to flip that back on.

Nope, I do have an updated branch since then, but it's still sitting
waiting on getting a bit more love. I suspect it'll be done for 6.14.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:51             ` Jens Axboe
@ 2024-11-27 21:58               ` Kent Overstreet
  2024-11-27 21:59                 ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 21:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 02:51:26PM -0700, Jens Axboe wrote:
> On Wed, Nov 27, 2024 at 2:27?PM Kent Overstreet <[email protected]> wrote:
> >
> > On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote:
> > > I'd argue the fact that you are using an mm from a different process
> > > without grabbing a reference is the wrinkle. I just don't think it's a
> > > problem right now, but it could be... aio is tied to the mm because of
> > > how it does completions, potentially, and hence needs this exit_aio()
> > > hack because of that. aio also doesn't care, because it doesn't care
> > > about blocking - it'll happily block during issue.
> >
> > I'm not trying to debate who's bug it is, I'm just checking if I need to
> > backport a security fix.
> 
> Not trying to place blame.
> 
> > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > > > very cursory glance leads me to suspect "no", it seems like this is a
> > > > bug if io_uring is allowed on bcachefs at all.
> > >
> > > I just looked at bcachefs dio writes, which look to be the only case of
> > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> > > always involved for the IO.
> >
> > Ok, sounds like we're in the clear. I already started writing the
> > patch, so it'll just be a "now we can turn on FMODE_NOWAIT" instead of
> > a bugfix.
> 
> That sounds good - and FMODE_NOWAIT will be a good addition. It'll make
> RWF_NOWAIT work, and things like io_uring will also work better as it
> won't need to needlessly punt to an io-wq worker to complete this IO.
> 
> > By the way, did the lifetime issue that was causing umount/remount to
> > fail ever get resolved? I've currently got no test coverage for
> > io_uring, would be nice to flip that back on.
> 
> Nope, I do have an updated branch since then, but it's still sitting
> waiting on getting a bit more love. I suspect it'll be done for 6.14.

Alright - if you want to ping me when that's ready, along with any other
knobs I should flip on for io_uring support, I'll flip io_uring back on
in my test infrastructure at that time.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:58               ` Kent Overstreet
@ 2024-11-27 21:59                 ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 21:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring

On 11/27/24 2:58 PM, Kent Overstreet wrote:
>>> By the way, did the lifetime issue that was causing umount/remount to
>>> fail ever get resolved? I've currently got no test coverage for
>>> io_uring, would be nice to flip that back on.
>>
>> Nope, I do have an updated branch since then, but it's still sitting
>> waiting on getting a bit more love. I suspect it'll be done for 6.14.
> 
> Alright - if you want to ping me when that's ready, along with any other
> knobs I should flip on for io_uring support, I'll flip io_uring back on
> in my test infrastructure at that time.

Will do, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:16         ` Jens Axboe
  2024-11-27 21:27           ` Kent Overstreet
@ 2024-11-27 21:39           ` Jann Horn
  2024-11-27 21:52             ` Jens Axboe
  1 sibling, 1 reply; 17+ messages in thread
From: Jann Horn @ 2024-11-27 21:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov,
	io-uring

On Wed, Nov 27, 2024 at 10:16 PM Jens Axboe <[email protected]> wrote:
> On 11/27/24 2:08 PM, Kent Overstreet wrote:
> > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote:
> >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet
> >> <[email protected]> wrote:
> >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> >>>> On 11/27/24 9:57 AM, Jann Horn wrote:
> >>>>> Hi!
> >>>>>
> >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> >>>>> (without any kind of refcount increment), and used in
> >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> >>>>> which are used to enable userspace memory access from kthread context.
> >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> >>>>> by holding an mmget() reference).
> >>>>>
> >>>>> If we reach this codepath via io_uring, do we have a guarantee that
> >>>>> the mm_struct that called bch2_direct_write() is still alive and
> >>>>> hasn't yet gone through exit_mmap() when it is accessed from
> >>>>> bch2_dio_write_continue()?
> >>>>>
> >>>>> I don't know the async direct I/O codepath particularly well, so I
> >>>>> cc'ed the uring maintainers, who probably know this better than me.
> >>>>
> >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> >>>> arguably should grab an mm ref for this just for safety to avoid future
> >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> >>>> means that on the io_uring side it cannot do non-blocking issue of
> >>>> requests. This is slower as it always punts to an io-wq thread, which
> >>>> shares the same mm. Hence if the request is alive, there's always a
> >>>> thread with the same mm alive as well.
> >>>>
> >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> >>>> to dig a bit deeper to verify that would always be safe and there's not
> >>>> a of time today with a few days off in the US looming, so I'll defer
> >>>> that to next week. It certainly would be fine with an mm ref grabbed.
> >>>
> >>> Wouldn't delivery of completions be tied to an address space (not a
> >>> process) like it is for aio?
> >>
> >> An io_uring instance is primarily exposed to userspace as a file
> >> descriptor, so AFAIK it is possible for the io_uring instance to live
> >> beyond when the last mmput() happens. io_uring initially only holds an
> >> mmgrab() reference on the MM (a comment above that explains: "This is
> >> just grabbed for accounting purposes"), which I think is not enough to
> >> make it stable enough for kthread_use_mm(); I think in io_uring, only
> >> the worker threads actually keep the MM fully alive (and AFAIK the
> >> uring worker threads can exit before the uring instance itself is torn
> >> down).
> >>
> >> To receive io_uring completions, there are multiple ways, but they
> >> don't use a pointer from the io_uring instance to the MM to access
> >> userspace memory. Instead, you can have a VMA that points to the
> >> io_uring instance, created by calling mmap() on the io_uring fd; or
> >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
> >> references to userspace-provided pages.
> >>
> >> On top of that, I think it might currently be possible to use the
> >> io_uring file descriptor from another task to submit work. (That would
> >> probably be fairly nonsensical, but I think the kernel does not
> >> currently prevent it.)
> >
> > Ok, that's a wrinkle.
>
> I'd argue the fact that you are using an mm from a different process
> without grabbing a reference is the wrinkle. I just don't think it's a
> problem right now, but it could be... aio is tied to the mm because of
> how it does completions, potentially, and hence needs this exit_aio()
> hack because of that. aio also doesn't care, because it doesn't care
> about blocking - it'll happily block during issue.
>
> > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > very cursory glance leads me to suspect "no", it seems like this is a
> > bug if io_uring is allowed on bcachefs at all.
>
> I just looked at bcachefs dio writes, which look to be the only case of
> this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> always involved for the IO.

I guess it could be an issue if the iocb can outlive the io-wq thread?
Like, a userspace task creates a uring instance and starts a write;
the write will be punted to a uring worker because of missing
FMODE_NOWAIT; then the uring worker enters io_write() and starts a
write on a kiocb. Can this write initiated from the worker be async?
And could the uring worker exit (for example, because the userspace
task exited) while the kiocb is still in flight?

Anyway, I see Kent just posted a patch to make this more robust.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:39           ` Jann Horn
@ 2024-11-27 21:52             ` Jens Axboe
  2024-11-27 21:53               ` Jann Horn
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2024-11-27 21:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov,
	io-uring

> On Wed, Nov 27, 2024 at 10:16?PM Jens Axboe <[email protected]> wrote:
> > On 11/27/24 2:08 PM, Kent Overstreet wrote:
> > > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote:
> > >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet
> > >> <[email protected]> wrote:
> > >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> > >>>> On 11/27/24 9:57 AM, Jann Horn wrote:
> > >>>>> Hi!
> > >>>>>
> > >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > >>>>> (without any kind of refcount increment), and used in
> > >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > >>>>> which are used to enable userspace memory access from kthread context.
> > >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > >>>>> by holding an mmget() reference).
> > >>>>>
> > >>>>> If we reach this codepath via io_uring, do we have a guarantee that
> > >>>>> the mm_struct that called bch2_direct_write() is still alive and
> > >>>>> hasn't yet gone through exit_mmap() when it is accessed from
> > >>>>> bch2_dio_write_continue()?
> > >>>>>
> > >>>>> I don't know the async direct I/O codepath particularly well, so I
> > >>>>> cc'ed the uring maintainers, who probably know this better than me.
> > >>>>
> > >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> > >>>> arguably should grab an mm ref for this just for safety to avoid future
> > >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> > >>>> means that on the io_uring side it cannot do non-blocking issue of
> > >>>> requests. This is slower as it always punts to an io-wq thread, which
> > >>>> shares the same mm. Hence if the request is alive, there's always a
> > >>>> thread with the same mm alive as well.
> > >>>>
> > >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> > >>>> to dig a bit deeper to verify that would always be safe and there's not
> > >>>> a of time today with a few days off in the US looming, so I'll defer
> > >>>> that to next week. It certainly would be fine with an mm ref grabbed.
> > >>>
> > >>> Wouldn't delivery of completions be tied to an address space (not a
> > >>> process) like it is for aio?
> > >>
> > >> An io_uring instance is primarily exposed to userspace as a file
> > >> descriptor, so AFAIK it is possible for the io_uring instance to live
> > >> beyond when the last mmput() happens. io_uring initially only holds an
> > >> mmgrab() reference on the MM (a comment above that explains: "This is
> > >> just grabbed for accounting purposes"), which I think is not enough to
> > >> make it stable enough for kthread_use_mm(); I think in io_uring, only
> > >> the worker threads actually keep the MM fully alive (and AFAIK the
> > >> uring worker threads can exit before the uring instance itself is torn
> > >> down).
> > >>
> > >> To receive io_uring completions, there are multiple ways, but they
> > >> don't use a pointer from the io_uring instance to the MM to access
> > >> userspace memory. Instead, you can have a VMA that points to the
> > >> io_uring instance, created by calling mmap() on the io_uring fd; or
> > >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
> > >> references to userspace-provided pages.
> > >>
> > >> On top of that, I think it might currently be possible to use the
> > >> io_uring file descriptor from another task to submit work. (That would
> > >> probably be fairly nonsensical, but I think the kernel does not
> > >> currently prevent it.)
> > >
> > > Ok, that's a wrinkle.
> >
> > I'd argue the fact that you are using an mm from a different process
> > without grabbing a reference is the wrinkle. I just don't think it's a
> > problem right now, but it could be... aio is tied to the mm because of
> > how it does completions, potentially, and hence needs this exit_aio()
> > hack because of that. aio also doesn't care, because it doesn't care
> > about blocking - it'll happily block during issue.
> >
> > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > > very cursory glance leads me to suspect "no", it seems like this is a
> > > bug if io_uring is allowed on bcachefs at all.
> >
> > I just looked at bcachefs dio writes, which look to be the only case of
> > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> > always involved for the IO.
>
> I guess it could be an issue if the iocb can outlive the io-wq thread?
> Like, a userspace task creates a uring instance and starts a write;
> the write will be punted to a uring worker because of missing
> FMODE_NOWAIT; then the uring worker enters io_write() and starts a
> write on a kiocb. Can this write initiated from the worker be async?
> And could the uring worker exit (for example, because the userspace
> task exited) while the kiocb is still in flight?

No, any write (or read, whatever) from an io-wq worker is always
blocking / sync. That's the main reason for them existing, to be able to
do blocking issue. And that's what they always do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 21:52             ` Jens Axboe
@ 2024-11-27 21:53               ` Jann Horn
  0 siblings, 0 replies; 17+ messages in thread
From: Jann Horn @ 2024-11-27 21:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov,
	io-uring

On Wed, Nov 27, 2024 at 10:52 PM Jens Axboe <[email protected]> wrote:
> > On Wed, Nov 27, 2024 at 10:16?PM Jens Axboe <[email protected]> wrote:
> > > On 11/27/24 2:08 PM, Kent Overstreet wrote:
> > > > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote:
> > > >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet
> > > >> <[email protected]> wrote:
> > > >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote:
> > > >>>> On 11/27/24 9:57 AM, Jann Horn wrote:
> > > >>>>> Hi!
> > > >>>>>
> > > >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> > > >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> > > >>>>> (without any kind of refcount increment), and used in
> > > >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> > > >>>>> which are used to enable userspace memory access from kthread context.
> > > >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> > > >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> > > >>>>> by holding an mmget() reference).
> > > >>>>>
> > > >>>>> If we reach this codepath via io_uring, do we have a guarantee that
> > > >>>>> the mm_struct that called bch2_direct_write() is still alive and
> > > >>>>> hasn't yet gone through exit_mmap() when it is accessed from
> > > >>>>> bch2_dio_write_continue()?
> > > >>>>>
> > > >>>>> I don't know the async direct I/O codepath particularly well, so I
> > > >>>>> cc'ed the uring maintainers, who probably know this better than me.
> > > >>>>
> > > >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs
> > > >>>> arguably should grab an mm ref for this just for safety to avoid future
> > > >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which
> > > >>>> means that on the io_uring side it cannot do non-blocking issue of
> > > >>>> requests. This is slower as it always punts to an io-wq thread, which
> > > >>>> shares the same mm. Hence if the request is alive, there's always a
> > > >>>> thread with the same mm alive as well.
> > > >>>>
> > > >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need
> > > >>>> to dig a bit deeper to verify that would always be safe and there's not
> > > >>>> a of time today with a few days off in the US looming, so I'll defer
> > > >>>> that to next week. It certainly would be fine with an mm ref grabbed.
> > > >>>
> > > >>> Wouldn't delivery of completions be tied to an address space (not a
> > > >>> process) like it is for aio?
> > > >>
> > > >> An io_uring instance is primarily exposed to userspace as a file
> > > >> descriptor, so AFAIK it is possible for the io_uring instance to live
> > > >> beyond when the last mmput() happens. io_uring initially only holds an
> > > >> mmgrab() reference on the MM (a comment above that explains: "This is
> > > >> just grabbed for accounting purposes"), which I think is not enough to
> > > >> make it stable enough for kthread_use_mm(); I think in io_uring, only
> > > >> the worker threads actually keep the MM fully alive (and AFAIK the
> > > >> uring worker threads can exit before the uring instance itself is torn
> > > >> down).
> > > >>
> > > >> To receive io_uring completions, there are multiple ways, but they
> > > >> don't use a pointer from the io_uring instance to the MM to access
> > > >> userspace memory. Instead, you can have a VMA that points to the
> > > >> io_uring instance, created by calling mmap() on the io_uring fd; or
> > > >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab
> > > >> references to userspace-provided pages.
> > > >>
> > > >> On top of that, I think it might currently be possible to use the
> > > >> io_uring file descriptor from another task to submit work. (That would
> > > >> probably be fairly nonsensical, but I think the kernel does not
> > > >> currently prevent it.)
> > > >
> > > > Ok, that's a wrinkle.
> > >
> > > I'd argue the fact that you are using an mm from a different process
> > > without grabbing a reference is the wrinkle. I just don't think it's a
> > > problem right now, but it could be... aio is tied to the mm because of
> > > how it does completions, potentially, and hence needs this exit_aio()
> > > hack because of that. aio also doesn't care, because it doesn't care
> > > about blocking - it'll happily block during issue.
> > >
> > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A
> > > > very cursory glance leads me to suspect "no", it seems like this is a
> > > > bug if io_uring is allowed on bcachefs at all.
> > >
> > > I just looked at bcachefs dio writes, which look to be the only case of
> > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is
> > > always involved for the IO.
> >
> > I guess it could be an issue if the iocb can outlive the io-wq thread?
> > Like, a userspace task creates a uring instance and starts a write;
> > the write will be punted to a uring worker because of missing
> > FMODE_NOWAIT; then the uring worker enters io_write() and starts a
> > write on a kiocb. Can this write initiated from the worker be async?
> > And could the uring worker exit (for example, because the userspace
> > task exited) while the kiocb is still in flight?
>
> No, any write (or read, whatever) from an io-wq worker is always
> blocking / sync. That's the main reason for them existing, to be able to
> do blocking issue. And that's what they always do.

Aaah, ok, thanks for the explanation.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: bcachefs: suspicious mm pointer in struct dio_write
  2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn
  2024-11-27 18:09 ` Jens Axboe
@ 2024-11-27 20:23 ` Kent Overstreet
  1 sibling, 0 replies; 17+ messages in thread
From: Kent Overstreet @ 2024-11-27 20:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-bcachefs, kernel list, Jens Axboe, Pavel Begunkov, io-uring

On Wed, Nov 27, 2024 at 05:57:03PM +0100, Jann Horn wrote:
> Hi!
> 
> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer
> to an mm_struct. This pointer is grabbed in bch2_direct_write()
> (without any kind of refcount increment), and used in
> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm()
> which are used to enable userspace memory access from kthread context.
> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller
> guarantees that the MM hasn't gone through exit_mmap() yet (normally
> by holding an mmget() reference).
> 
> If we reach this codepath via io_uring, do we have a guarantee that
> the mm_struct that called bch2_direct_write() is still alive and
> hasn't yet gone through exit_mmap() when it is accessed from
> bch2_dio_write_continue()?
> 
> I don't know the async direct I/O codepath particularly well, so I
> cc'ed the uring maintainers, who probably know this better than me.

I don't know about io_uring but aio guarantees that outstanding kiocbs
are completed before exiting the mm_struct - I would expect some sort of
similar guarantee to hold, because otherwise where are we going to
deliver the completion?

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-11-27 21:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn
2024-11-27 18:09 ` Jens Axboe
2024-11-27 19:43   ` Jann Horn
2024-11-27 20:01     ` Jens Axboe
2024-11-27 20:31       ` Kent Overstreet
2024-11-27 20:25   ` Kent Overstreet
2024-11-27 20:44     ` Jann Horn
2024-11-27 21:08       ` Kent Overstreet
2024-11-27 21:16         ` Jens Axboe
2024-11-27 21:27           ` Kent Overstreet
2024-11-27 21:51             ` Jens Axboe
2024-11-27 21:58               ` Kent Overstreet
2024-11-27 21:59                 ` Jens Axboe
2024-11-27 21:39           ` Jann Horn
2024-11-27 21:52             ` Jens Axboe
2024-11-27 21:53               ` Jann Horn
2024-11-27 20:23 ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox