* bcachefs: suspicious mm pointer in struct dio_write @ 2024-11-27 16:57 Jann Horn 2024-11-27 18:09 ` Jens Axboe 2024-11-27 20:23 ` Kent Overstreet 0 siblings, 2 replies; 17+ messages in thread From: Jann Horn @ 2024-11-27 16:57 UTC (permalink / raw) To: Kent Overstreet, linux-bcachefs Cc: kernel list, Jens Axboe, Pavel Begunkov, io-uring Hi! In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer to an mm_struct. This pointer is grabbed in bch2_direct_write() (without any kind of refcount increment), and used in bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() which are used to enable userspace memory access from kthread context. I believe kthread_use_mm()/kthread_unuse_mm() require that the caller guarantees that the MM hasn't gone through exit_mmap() yet (normally by holding an mmget() reference). If we reach this codepath via io_uring, do we have a guarantee that the mm_struct that called bch2_direct_write() is still alive and hasn't yet gone through exit_mmap() when it is accessed from bch2_dio_write_continue()? I don't know the async direct I/O codepath particularly well, so I cc'ed the uring maintainers, who probably know this better than me. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn @ 2024-11-27 18:09 ` Jens Axboe 2024-11-27 19:43 ` Jann Horn 2024-11-27 20:25 ` Kent Overstreet 2024-11-27 20:23 ` Kent Overstreet 1 sibling, 2 replies; 17+ messages in thread From: Jens Axboe @ 2024-11-27 18:09 UTC (permalink / raw) To: Jann Horn, Kent Overstreet, linux-bcachefs Cc: kernel list, Pavel Begunkov, io-uring On 11/27/24 9:57 AM, Jann Horn wrote: > Hi! > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > to an mm_struct. This pointer is grabbed in bch2_direct_write() > (without any kind of refcount increment), and used in > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > which are used to enable userspace memory access from kthread context. > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > guarantees that the MM hasn't gone through exit_mmap() yet (normally > by holding an mmget() reference). > > If we reach this codepath via io_uring, do we have a guarantee that > the mm_struct that called bch2_direct_write() is still alive and > hasn't yet gone through exit_mmap() when it is accessed from > bch2_dio_write_continue()? > > I don't know the async direct I/O codepath particularly well, so I > cc'ed the uring maintainers, who probably know this better than me. I _think_ this is fine as-is, even if it does look dubious and bcachefs arguably should grab an mm ref for this just for safety to avoid future problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which means that on the io_uring side it cannot do non-blocking issue of requests. This is slower as it always punts to an io-wq thread, which shares the same mm. Hence if the request is alive, there's always a thread with the same mm alive as well. Now if FMODE_NOWAIT was set, then the original task could exit. I'd need to dig a bit deeper to verify that would always be safe and there's not a of time today with a few days off in the US looming, so I'll defer that to next week. It certainly would be fine with an mm ref grabbed. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 18:09 ` Jens Axboe @ 2024-11-27 19:43 ` Jann Horn 2024-11-27 20:01 ` Jens Axboe 2024-11-27 20:25 ` Kent Overstreet 1 sibling, 1 reply; 17+ messages in thread From: Jann Horn @ 2024-11-27 19:43 UTC (permalink / raw) To: Jens Axboe Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 7:09 PM Jens Axboe <[email protected]> wrote: > On 11/27/24 9:57 AM, Jann Horn wrote: > > Hi! > > > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > to an mm_struct. This pointer is grabbed in bch2_direct_write() > > (without any kind of refcount increment), and used in > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > which are used to enable userspace memory access from kthread context. > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > guarantees that the MM hasn't gone through exit_mmap() yet (normally > > by holding an mmget() reference). > > > > If we reach this codepath via io_uring, do we have a guarantee that > > the mm_struct that called bch2_direct_write() is still alive and > > hasn't yet gone through exit_mmap() when it is accessed from > > bch2_dio_write_continue()? > > > > I don't know the async direct I/O codepath particularly well, so I > > cc'ed the uring maintainers, who probably know this better than me. > > I _think_ this is fine as-is, even if it does look dubious and bcachefs > arguably should grab an mm ref for this just for safety to avoid future > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > means that on the io_uring side it cannot do non-blocking issue of > requests. This is slower as it always punts to an io-wq thread, which > shares the same mm. Hence if the request is alive, there's always a > thread with the same mm alive as well. > > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > to dig a bit deeper to verify that would always be safe and there's not > a of time today with a few days off in the US looming, so I'll defer > that to next week. It certainly would be fine with an mm ref grabbed. Ah, thanks for looking into it! I missed this implication of not setting FMODE_NOWAIT. Anyway, what you said sounds like it would be cleaner for bcachefs to grab its own extra reference, maybe by initially grabbing an mm reference with mmgrab() in bch2_direct_write(), and then use mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is stable. What do other file systems do for this? I think they normally grab page references so that they don't need the MM anymore when asynchronously fulfilling the request, right? Like in iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab references to the pages corresponding to the userspace regions in dio->submit.iter? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 19:43 ` Jann Horn @ 2024-11-27 20:01 ` Jens Axboe 2024-11-27 20:31 ` Kent Overstreet 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2024-11-27 20:01 UTC (permalink / raw) To: Jann Horn Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On 11/27/24 12:43 PM, Jann Horn wrote: > On Wed, Nov 27, 2024 at 7:09?PM Jens Axboe <[email protected]> wrote: >> On 11/27/24 9:57 AM, Jann Horn wrote: >>> Hi! >>> >>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer >>> to an mm_struct. This pointer is grabbed in bch2_direct_write() >>> (without any kind of refcount increment), and used in >>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() >>> which are used to enable userspace memory access from kthread context. >>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller >>> guarantees that the MM hasn't gone through exit_mmap() yet (normally >>> by holding an mmget() reference). >>> >>> If we reach this codepath via io_uring, do we have a guarantee that >>> the mm_struct that called bch2_direct_write() is still alive and >>> hasn't yet gone through exit_mmap() when it is accessed from >>> bch2_dio_write_continue()? >>> >>> I don't know the async direct I/O codepath particularly well, so I >>> cc'ed the uring maintainers, who probably know this better than me. >> >> I _think_ this is fine as-is, even if it does look dubious and bcachefs >> arguably should grab an mm ref for this just for safety to avoid future >> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which >> means that on the io_uring side it cannot do non-blocking issue of >> requests. This is slower as it always punts to an io-wq thread, which >> shares the same mm. Hence if the request is alive, there's always a >> thread with the same mm alive as well. >> >> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need >> to dig a bit deeper to verify that would always be safe and there's not >> a of time today with a few days off in the US looming, so I'll defer >> that to next week. It certainly would be fine with an mm ref grabbed. > > Ah, thanks for looking into it! I missed this implication of not > setting FMODE_NOWAIT. > > Anyway, what you said sounds like it would be cleaner for bcachefs to > grab its own extra reference, maybe by initially grabbing an mm > reference with mmgrab() in bch2_direct_write(), and then use > mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is > stable. Yep I think that would definitely make it more sturdy, and also less headscratchy in terms of being able to verify it's actually safe. > What do other file systems do for this? I think they normally grab > page references so that they don't need the MM anymore when > asynchronously fulfilling the request, right? Like in > iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab > references to the pages corresponding to the userspace regions in > dio->submit.iter? Not aware of anything else doing it like this, where it's punted to a kthread and then the mm used from there. The upfront page getting/mapping is the common approach, like you described. Which does seem like a much better choice, rather than needing to rely on the mm in a kworker. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 20:01 ` Jens Axboe @ 2024-11-27 20:31 ` Kent Overstreet 0 siblings, 0 replies; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 20:31 UTC (permalink / raw) To: Jens Axboe Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 01:01:31PM -0700, Jens Axboe wrote: > On 11/27/24 12:43 PM, Jann Horn wrote: > > On Wed, Nov 27, 2024 at 7:09?PM Jens Axboe <[email protected]> wrote: > >> On 11/27/24 9:57 AM, Jann Horn wrote: > >>> Hi! > >>> > >>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > >>> to an mm_struct. This pointer is grabbed in bch2_direct_write() > >>> (without any kind of refcount increment), and used in > >>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > >>> which are used to enable userspace memory access from kthread context. > >>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > >>> guarantees that the MM hasn't gone through exit_mmap() yet (normally > >>> by holding an mmget() reference). > >>> > >>> If we reach this codepath via io_uring, do we have a guarantee that > >>> the mm_struct that called bch2_direct_write() is still alive and > >>> hasn't yet gone through exit_mmap() when it is accessed from > >>> bch2_dio_write_continue()? > >>> > >>> I don't know the async direct I/O codepath particularly well, so I > >>> cc'ed the uring maintainers, who probably know this better than me. > >> > >> I _think_ this is fine as-is, even if it does look dubious and bcachefs > >> arguably should grab an mm ref for this just for safety to avoid future > >> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > >> means that on the io_uring side it cannot do non-blocking issue of > >> requests. This is slower as it always punts to an io-wq thread, which > >> shares the same mm. Hence if the request is alive, there's always a > >> thread with the same mm alive as well. > >> > >> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > >> to dig a bit deeper to verify that would always be safe and there's not > >> a of time today with a few days off in the US looming, so I'll defer > >> that to next week. It certainly would be fine with an mm ref grabbed. > > > > Ah, thanks for looking into it! I missed this implication of not > > setting FMODE_NOWAIT. > > > > Anyway, what you said sounds like it would be cleaner for bcachefs to > > grab its own extra reference, maybe by initially grabbing an mm > > reference with mmgrab() in bch2_direct_write(), and then use > > mmget_not_zero() in bch2_dio_write_continue() to ensure the MM is > > stable. > > Yep I think that would definitely make it more sturdy, and also less > headscratchy in terms of being able to verify it's actually safe. > > > What do other file systems do for this? I think they normally grab > > page references so that they don't need the MM anymore when > > asynchronously fulfilling the request, right? Like in > > iomap_dio_bio_iter(), which uses bio_iov_iter_get_pages() to grab > > references to the pages corresponding to the userspace regions in > > dio->submit.iter? > > Not aware of anything else doing it like this, where it's punted to a > kthread and then the mm used from there. The upfront page > getting/mapping is the common approach, like you described. Which does > seem like a much better choice, rather than needing to rely on the mm in > a kworker. More common, but not necessarily better: the "pin everything up front" approach had the disadvantage that - well, you pinned everything up front: if userspace requests a single multi-gigabyte IO, that is likely not what you want. The old dio code didn't pin everything up front for this reason (but IIRC wasn't fully asynchronous, and also had a silly baked in 64 page limit). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 18:09 ` Jens Axboe 2024-11-27 19:43 ` Jann Horn @ 2024-11-27 20:25 ` Kent Overstreet 2024-11-27 20:44 ` Jann Horn 1 sibling, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 20:25 UTC (permalink / raw) To: Jens Axboe Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > On 11/27/24 9:57 AM, Jann Horn wrote: > > Hi! > > > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > to an mm_struct. This pointer is grabbed in bch2_direct_write() > > (without any kind of refcount increment), and used in > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > which are used to enable userspace memory access from kthread context. > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > guarantees that the MM hasn't gone through exit_mmap() yet (normally > > by holding an mmget() reference). > > > > If we reach this codepath via io_uring, do we have a guarantee that > > the mm_struct that called bch2_direct_write() is still alive and > > hasn't yet gone through exit_mmap() when it is accessed from > > bch2_dio_write_continue()? > > > > I don't know the async direct I/O codepath particularly well, so I > > cc'ed the uring maintainers, who probably know this better than me. > > I _think_ this is fine as-is, even if it does look dubious and bcachefs > arguably should grab an mm ref for this just for safety to avoid future > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > means that on the io_uring side it cannot do non-blocking issue of > requests. This is slower as it always punts to an io-wq thread, which > shares the same mm. Hence if the request is alive, there's always a > thread with the same mm alive as well. > > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > to dig a bit deeper to verify that would always be safe and there's not > a of time today with a few days off in the US looming, so I'll defer > that to next week. It certainly would be fine with an mm ref grabbed. Wouldn't delivery of completions be tied to an address space (not a process) like it is for aio? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 20:25 ` Kent Overstreet @ 2024-11-27 20:44 ` Jann Horn 2024-11-27 21:08 ` Kent Overstreet 0 siblings, 1 reply; 17+ messages in thread From: Jann Horn @ 2024-11-27 20:44 UTC (permalink / raw) To: Kent Overstreet Cc: Jens Axboe, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 9:25 PM Kent Overstreet <[email protected]> wrote: > On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > > On 11/27/24 9:57 AM, Jann Horn wrote: > > > Hi! > > > > > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > > to an mm_struct. This pointer is grabbed in bch2_direct_write() > > > (without any kind of refcount increment), and used in > > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > > which are used to enable userspace memory access from kthread context. > > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > > guarantees that the MM hasn't gone through exit_mmap() yet (normally > > > by holding an mmget() reference). > > > > > > If we reach this codepath via io_uring, do we have a guarantee that > > > the mm_struct that called bch2_direct_write() is still alive and > > > hasn't yet gone through exit_mmap() when it is accessed from > > > bch2_dio_write_continue()? > > > > > > I don't know the async direct I/O codepath particularly well, so I > > > cc'ed the uring maintainers, who probably know this better than me. > > > > I _think_ this is fine as-is, even if it does look dubious and bcachefs > > arguably should grab an mm ref for this just for safety to avoid future > > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > > means that on the io_uring side it cannot do non-blocking issue of > > requests. This is slower as it always punts to an io-wq thread, which > > shares the same mm. Hence if the request is alive, there's always a > > thread with the same mm alive as well. > > > > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > > to dig a bit deeper to verify that would always be safe and there's not > > a of time today with a few days off in the US looming, so I'll defer > > that to next week. It certainly would be fine with an mm ref grabbed. > > Wouldn't delivery of completions be tied to an address space (not a > process) like it is for aio? An io_uring instance is primarily exposed to userspace as a file descriptor, so AFAIK it is possible for the io_uring instance to live beyond when the last mmput() happens. io_uring initially only holds an mmgrab() reference on the MM (a comment above that explains: "This is just grabbed for accounting purposes"), which I think is not enough to make it stable enough for kthread_use_mm(); I think in io_uring, only the worker threads actually keep the MM fully alive (and AFAIK the uring worker threads can exit before the uring instance itself is torn down). To receive io_uring completions, there are multiple ways, but they don't use a pointer from the io_uring instance to the MM to access userspace memory. Instead, you can have a VMA that points to the io_uring instance, created by calling mmap() on the io_uring fd; or alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab references to userspace-provided pages. On top of that, I think it might currently be possible to use the io_uring file descriptor from another task to submit work. (That would probably be fairly nonsensical, but I think the kernel does not currently prevent it.) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 20:44 ` Jann Horn @ 2024-11-27 21:08 ` Kent Overstreet 2024-11-27 21:16 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 21:08 UTC (permalink / raw) To: Jann Horn Cc: Jens Axboe, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote: > On Wed, Nov 27, 2024 at 9:25 PM Kent Overstreet > <[email protected]> wrote: > > On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > > > On 11/27/24 9:57 AM, Jann Horn wrote: > > > > Hi! > > > > > > > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > > > to an mm_struct. This pointer is grabbed in bch2_direct_write() > > > > (without any kind of refcount increment), and used in > > > > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > > > which are used to enable userspace memory access from kthread context. > > > > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > > > guarantees that the MM hasn't gone through exit_mmap() yet (normally > > > > by holding an mmget() reference). > > > > > > > > If we reach this codepath via io_uring, do we have a guarantee that > > > > the mm_struct that called bch2_direct_write() is still alive and > > > > hasn't yet gone through exit_mmap() when it is accessed from > > > > bch2_dio_write_continue()? > > > > > > > > I don't know the async direct I/O codepath particularly well, so I > > > > cc'ed the uring maintainers, who probably know this better than me. > > > > > > I _think_ this is fine as-is, even if it does look dubious and bcachefs > > > arguably should grab an mm ref for this just for safety to avoid future > > > problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > > > means that on the io_uring side it cannot do non-blocking issue of > > > requests. This is slower as it always punts to an io-wq thread, which > > > shares the same mm. Hence if the request is alive, there's always a > > > thread with the same mm alive as well. > > > > > > Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > > > to dig a bit deeper to verify that would always be safe and there's not > > > a of time today with a few days off in the US looming, so I'll defer > > > that to next week. It certainly would be fine with an mm ref grabbed. > > > > Wouldn't delivery of completions be tied to an address space (not a > > process) like it is for aio? > > An io_uring instance is primarily exposed to userspace as a file > descriptor, so AFAIK it is possible for the io_uring instance to live > beyond when the last mmput() happens. io_uring initially only holds an > mmgrab() reference on the MM (a comment above that explains: "This is > just grabbed for accounting purposes"), which I think is not enough to > make it stable enough for kthread_use_mm(); I think in io_uring, only > the worker threads actually keep the MM fully alive (and AFAIK the > uring worker threads can exit before the uring instance itself is torn > down). > > To receive io_uring completions, there are multiple ways, but they > don't use a pointer from the io_uring instance to the MM to access > userspace memory. Instead, you can have a VMA that points to the > io_uring instance, created by calling mmap() on the io_uring fd; or > alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab > references to userspace-provided pages. > > On top of that, I think it might currently be possible to use the > io_uring file descriptor from another task to submit work. (That would > probably be fairly nonsensical, but I think the kernel does not > currently prevent it.) Ok, that's a wrinkle. Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A very cursory glance leads me to suspect "no", it seems like this is a bug if io_uring is allowed on bcachefs at all. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:08 ` Kent Overstreet @ 2024-11-27 21:16 ` Jens Axboe 2024-11-27 21:27 ` Kent Overstreet 2024-11-27 21:39 ` Jann Horn 0 siblings, 2 replies; 17+ messages in thread From: Jens Axboe @ 2024-11-27 21:16 UTC (permalink / raw) To: Kent Overstreet, Jann Horn Cc: linux-bcachefs, kernel list, Pavel Begunkov, io-uring On 11/27/24 2:08 PM, Kent Overstreet wrote: > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote: >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet >> <[email protected]> wrote: >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: >>>> On 11/27/24 9:57 AM, Jann Horn wrote: >>>>> Hi! >>>>> >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write() >>>>> (without any kind of refcount increment), and used in >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() >>>>> which are used to enable userspace memory access from kthread context. >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally >>>>> by holding an mmget() reference). >>>>> >>>>> If we reach this codepath via io_uring, do we have a guarantee that >>>>> the mm_struct that called bch2_direct_write() is still alive and >>>>> hasn't yet gone through exit_mmap() when it is accessed from >>>>> bch2_dio_write_continue()? >>>>> >>>>> I don't know the async direct I/O codepath particularly well, so I >>>>> cc'ed the uring maintainers, who probably know this better than me. >>>> >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs >>>> arguably should grab an mm ref for this just for safety to avoid future >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which >>>> means that on the io_uring side it cannot do non-blocking issue of >>>> requests. This is slower as it always punts to an io-wq thread, which >>>> shares the same mm. Hence if the request is alive, there's always a >>>> thread with the same mm alive as well. >>>> >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need >>>> to dig a bit deeper to verify that would always be safe and there's not >>>> a of time today with a few days off in the US looming, so I'll defer >>>> that to next week. It certainly would be fine with an mm ref grabbed. >>> >>> Wouldn't delivery of completions be tied to an address space (not a >>> process) like it is for aio? >> >> An io_uring instance is primarily exposed to userspace as a file >> descriptor, so AFAIK it is possible for the io_uring instance to live >> beyond when the last mmput() happens. io_uring initially only holds an >> mmgrab() reference on the MM (a comment above that explains: "This is >> just grabbed for accounting purposes"), which I think is not enough to >> make it stable enough for kthread_use_mm(); I think in io_uring, only >> the worker threads actually keep the MM fully alive (and AFAIK the >> uring worker threads can exit before the uring instance itself is torn >> down). >> >> To receive io_uring completions, there are multiple ways, but they >> don't use a pointer from the io_uring instance to the MM to access >> userspace memory. Instead, you can have a VMA that points to the >> io_uring instance, created by calling mmap() on the io_uring fd; or >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab >> references to userspace-provided pages. >> >> On top of that, I think it might currently be possible to use the >> io_uring file descriptor from another task to submit work. (That would >> probably be fairly nonsensical, but I think the kernel does not >> currently prevent it.) > > Ok, that's a wrinkle. I'd argue the fact that you are using an mm from a different process without grabbing a reference is the wrinkle. I just don't think it's a problem right now, but it could be... aio is tied to the mm because of how it does completions, potentially, and hence needs this exit_aio() hack because of that. aio also doesn't care, because it doesn't care about blocking - it'll happily block during issue. > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > very cursory glance leads me to suspect "no", it seems like this is a > bug if io_uring is allowed on bcachefs at all. I just looked at bcachefs dio writes, which look to be the only case of this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is always involved for the IO. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:16 ` Jens Axboe @ 2024-11-27 21:27 ` Kent Overstreet 2024-11-27 21:51 ` Jens Axboe 2024-11-27 21:39 ` Jann Horn 1 sibling, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 21:27 UTC (permalink / raw) To: Jens Axboe Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote: > I'd argue the fact that you are using an mm from a different process > without grabbing a reference is the wrinkle. I just don't think it's a > problem right now, but it could be... aio is tied to the mm because of > how it does completions, potentially, and hence needs this exit_aio() > hack because of that. aio also doesn't care, because it doesn't care > about blocking - it'll happily block during issue. I'm not trying to debate who's bug it is, I'm just checking if I need to backport a security fix. > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > very cursory glance leads me to suspect "no", it seems like this is a > > bug if io_uring is allowed on bcachefs at all. > > I just looked at bcachefs dio writes, which look to be the only case of > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > always involved for the IO. Ok, sounds like we're in the clear. I already started writing the patch, so it'll just be a "now we can turn on FMODE_NOWAIT" instead of a bugfix. By the way, did the lifetime issue that was causing umount/remount to fail ever get resolved? I've currently got no test coverage for io_uring, would be nice to flip that back on. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:27 ` Kent Overstreet @ 2024-11-27 21:51 ` Jens Axboe 2024-11-27 21:58 ` Kent Overstreet 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2024-11-27 21:51 UTC (permalink / raw) To: Kent Overstreet Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 2:27?PM Kent Overstreet <[email protected]> wrote: > > On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote: > > I'd argue the fact that you are using an mm from a different process > > without grabbing a reference is the wrinkle. I just don't think it's a > > problem right now, but it could be... aio is tied to the mm because of > > how it does completions, potentially, and hence needs this exit_aio() > > hack because of that. aio also doesn't care, because it doesn't care > > about blocking - it'll happily block during issue. > > I'm not trying to debate who's bug it is, I'm just checking if I need to > backport a security fix. Not trying to place blame. > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > > very cursory glance leads me to suspect "no", it seems like this is a > > > bug if io_uring is allowed on bcachefs at all. > > > > I just looked at bcachefs dio writes, which look to be the only case of > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > > always involved for the IO. > > Ok, sounds like we're in the clear. I already started writing the > patch, so it'll just be a "now we can turn on FMODE_NOWAIT" instead of > a bugfix. That sounds good - and FMODE_NOWAIT will be a good addition. It'll make RWF_NOWAIT work, and things like io_uring will also work better as it won't need to needlessly punt to an io-wq worker to complete this IO. > By the way, did the lifetime issue that was causing umount/remount to > fail ever get resolved? I've currently got no test coverage for > io_uring, would be nice to flip that back on. Nope, I do have an updated branch since then, but it's still sitting waiting on getting a bit more love. I suspect it'll be done for 6.14. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:51 ` Jens Axboe @ 2024-11-27 21:58 ` Kent Overstreet 2024-11-27 21:59 ` Jens Axboe 0 siblings, 1 reply; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 21:58 UTC (permalink / raw) To: Jens Axboe Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 02:51:26PM -0700, Jens Axboe wrote: > On Wed, Nov 27, 2024 at 2:27?PM Kent Overstreet <[email protected]> wrote: > > > > On Wed, Nov 27, 2024 at 02:16:24PM -0700, Jens Axboe wrote: > > > I'd argue the fact that you are using an mm from a different process > > > without grabbing a reference is the wrinkle. I just don't think it's a > > > problem right now, but it could be... aio is tied to the mm because of > > > how it does completions, potentially, and hence needs this exit_aio() > > > hack because of that. aio also doesn't care, because it doesn't care > > > about blocking - it'll happily block during issue. > > > > I'm not trying to debate who's bug it is, I'm just checking if I need to > > backport a security fix. > > Not trying to place blame. > > > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > > > very cursory glance leads me to suspect "no", it seems like this is a > > > > bug if io_uring is allowed on bcachefs at all. > > > > > > I just looked at bcachefs dio writes, which look to be the only case of > > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > > > always involved for the IO. > > > > Ok, sounds like we're in the clear. I already started writing the > > patch, so it'll just be a "now we can turn on FMODE_NOWAIT" instead of > > a bugfix. > > That sounds good - and FMODE_NOWAIT will be a good addition. It'll make > RWF_NOWAIT work, and things like io_uring will also work better as it > won't need to needlessly punt to an io-wq worker to complete this IO. > > > By the way, did the lifetime issue that was causing umount/remount to > > fail ever get resolved? I've currently got no test coverage for > > io_uring, would be nice to flip that back on. > > Nope, I do have an updated branch since then, but it's still sitting > waiting on getting a bit more love. I suspect it'll be done for 6.14. Alright - if you want to ping me when that's ready, along with any other knobs I should flip on for io_uring support, I'll flip io_uring back on in my test infrastructure at that time. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:58 ` Kent Overstreet @ 2024-11-27 21:59 ` Jens Axboe 0 siblings, 0 replies; 17+ messages in thread From: Jens Axboe @ 2024-11-27 21:59 UTC (permalink / raw) To: Kent Overstreet Cc: Jann Horn, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On 11/27/24 2:58 PM, Kent Overstreet wrote: >>> By the way, did the lifetime issue that was causing umount/remount to >>> fail ever get resolved? I've currently got no test coverage for >>> io_uring, would be nice to flip that back on. >> >> Nope, I do have an updated branch since then, but it's still sitting >> waiting on getting a bit more love. I suspect it'll be done for 6.14. > > Alright - if you want to ping me when that's ready, along with any other > knobs I should flip on for io_uring support, I'll flip io_uring back on > in my test infrastructure at that time. Will do, thanks. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:16 ` Jens Axboe 2024-11-27 21:27 ` Kent Overstreet @ 2024-11-27 21:39 ` Jann Horn 2024-11-27 21:52 ` Jens Axboe 1 sibling, 1 reply; 17+ messages in thread From: Jann Horn @ 2024-11-27 21:39 UTC (permalink / raw) To: Jens Axboe Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 10:16 PM Jens Axboe <[email protected]> wrote: > On 11/27/24 2:08 PM, Kent Overstreet wrote: > > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote: > >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet > >> <[email protected]> wrote: > >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > >>>> On 11/27/24 9:57 AM, Jann Horn wrote: > >>>>> Hi! > >>>>> > >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write() > >>>>> (without any kind of refcount increment), and used in > >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > >>>>> which are used to enable userspace memory access from kthread context. > >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally > >>>>> by holding an mmget() reference). > >>>>> > >>>>> If we reach this codepath via io_uring, do we have a guarantee that > >>>>> the mm_struct that called bch2_direct_write() is still alive and > >>>>> hasn't yet gone through exit_mmap() when it is accessed from > >>>>> bch2_dio_write_continue()? > >>>>> > >>>>> I don't know the async direct I/O codepath particularly well, so I > >>>>> cc'ed the uring maintainers, who probably know this better than me. > >>>> > >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs > >>>> arguably should grab an mm ref for this just for safety to avoid future > >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > >>>> means that on the io_uring side it cannot do non-blocking issue of > >>>> requests. This is slower as it always punts to an io-wq thread, which > >>>> shares the same mm. Hence if the request is alive, there's always a > >>>> thread with the same mm alive as well. > >>>> > >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > >>>> to dig a bit deeper to verify that would always be safe and there's not > >>>> a of time today with a few days off in the US looming, so I'll defer > >>>> that to next week. It certainly would be fine with an mm ref grabbed. > >>> > >>> Wouldn't delivery of completions be tied to an address space (not a > >>> process) like it is for aio? > >> > >> An io_uring instance is primarily exposed to userspace as a file > >> descriptor, so AFAIK it is possible for the io_uring instance to live > >> beyond when the last mmput() happens. io_uring initially only holds an > >> mmgrab() reference on the MM (a comment above that explains: "This is > >> just grabbed for accounting purposes"), which I think is not enough to > >> make it stable enough for kthread_use_mm(); I think in io_uring, only > >> the worker threads actually keep the MM fully alive (and AFAIK the > >> uring worker threads can exit before the uring instance itself is torn > >> down). > >> > >> To receive io_uring completions, there are multiple ways, but they > >> don't use a pointer from the io_uring instance to the MM to access > >> userspace memory. Instead, you can have a VMA that points to the > >> io_uring instance, created by calling mmap() on the io_uring fd; or > >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab > >> references to userspace-provided pages. > >> > >> On top of that, I think it might currently be possible to use the > >> io_uring file descriptor from another task to submit work. (That would > >> probably be fairly nonsensical, but I think the kernel does not > >> currently prevent it.) > > > > Ok, that's a wrinkle. > > I'd argue the fact that you are using an mm from a different process > without grabbing a reference is the wrinkle. I just don't think it's a > problem right now, but it could be... aio is tied to the mm because of > how it does completions, potentially, and hence needs this exit_aio() > hack because of that. aio also doesn't care, because it doesn't care > about blocking - it'll happily block during issue. > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > very cursory glance leads me to suspect "no", it seems like this is a > > bug if io_uring is allowed on bcachefs at all. > > I just looked at bcachefs dio writes, which look to be the only case of > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > always involved for the IO. I guess it could be an issue if the iocb can outlive the io-wq thread? Like, a userspace task creates a uring instance and starts a write; the write will be punted to a uring worker because of missing FMODE_NOWAIT; then the uring worker enters io_write() and starts a write on a kiocb. Can this write initiated from the worker be async? And could the uring worker exit (for example, because the userspace task exited) while the kiocb is still in flight? Anyway, I see Kent just posted a patch to make this more robust. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:39 ` Jann Horn @ 2024-11-27 21:52 ` Jens Axboe 2024-11-27 21:53 ` Jann Horn 0 siblings, 1 reply; 17+ messages in thread From: Jens Axboe @ 2024-11-27 21:52 UTC (permalink / raw) To: Jann Horn Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov, io-uring > On Wed, Nov 27, 2024 at 10:16?PM Jens Axboe <[email protected]> wrote: > > On 11/27/24 2:08 PM, Kent Overstreet wrote: > > > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote: > > >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet > > >> <[email protected]> wrote: > > >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > > >>>> On 11/27/24 9:57 AM, Jann Horn wrote: > > >>>>> Hi! > > >>>>> > > >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write() > > >>>>> (without any kind of refcount increment), and used in > > >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > >>>>> which are used to enable userspace memory access from kthread context. > > >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally > > >>>>> by holding an mmget() reference). > > >>>>> > > >>>>> If we reach this codepath via io_uring, do we have a guarantee that > > >>>>> the mm_struct that called bch2_direct_write() is still alive and > > >>>>> hasn't yet gone through exit_mmap() when it is accessed from > > >>>>> bch2_dio_write_continue()? > > >>>>> > > >>>>> I don't know the async direct I/O codepath particularly well, so I > > >>>>> cc'ed the uring maintainers, who probably know this better than me. > > >>>> > > >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs > > >>>> arguably should grab an mm ref for this just for safety to avoid future > > >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > > >>>> means that on the io_uring side it cannot do non-blocking issue of > > >>>> requests. This is slower as it always punts to an io-wq thread, which > > >>>> shares the same mm. Hence if the request is alive, there's always a > > >>>> thread with the same mm alive as well. > > >>>> > > >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > > >>>> to dig a bit deeper to verify that would always be safe and there's not > > >>>> a of time today with a few days off in the US looming, so I'll defer > > >>>> that to next week. It certainly would be fine with an mm ref grabbed. > > >>> > > >>> Wouldn't delivery of completions be tied to an address space (not a > > >>> process) like it is for aio? > > >> > > >> An io_uring instance is primarily exposed to userspace as a file > > >> descriptor, so AFAIK it is possible for the io_uring instance to live > > >> beyond when the last mmput() happens. io_uring initially only holds an > > >> mmgrab() reference on the MM (a comment above that explains: "This is > > >> just grabbed for accounting purposes"), which I think is not enough to > > >> make it stable enough for kthread_use_mm(); I think in io_uring, only > > >> the worker threads actually keep the MM fully alive (and AFAIK the > > >> uring worker threads can exit before the uring instance itself is torn > > >> down). > > >> > > >> To receive io_uring completions, there are multiple ways, but they > > >> don't use a pointer from the io_uring instance to the MM to access > > >> userspace memory. Instead, you can have a VMA that points to the > > >> io_uring instance, created by calling mmap() on the io_uring fd; or > > >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab > > >> references to userspace-provided pages. > > >> > > >> On top of that, I think it might currently be possible to use the > > >> io_uring file descriptor from another task to submit work. (That would > > >> probably be fairly nonsensical, but I think the kernel does not > > >> currently prevent it.) > > > > > > Ok, that's a wrinkle. > > > > I'd argue the fact that you are using an mm from a different process > > without grabbing a reference is the wrinkle. I just don't think it's a > > problem right now, but it could be... aio is tied to the mm because of > > how it does completions, potentially, and hence needs this exit_aio() > > hack because of that. aio also doesn't care, because it doesn't care > > about blocking - it'll happily block during issue. > > > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > > very cursory glance leads me to suspect "no", it seems like this is a > > > bug if io_uring is allowed on bcachefs at all. > > > > I just looked at bcachefs dio writes, which look to be the only case of > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > > always involved for the IO. > > I guess it could be an issue if the iocb can outlive the io-wq thread? > Like, a userspace task creates a uring instance and starts a write; > the write will be punted to a uring worker because of missing > FMODE_NOWAIT; then the uring worker enters io_write() and starts a > write on a kiocb. Can this write initiated from the worker be async? > And could the uring worker exit (for example, because the userspace > task exited) while the kiocb is still in flight? No, any write (or read, whatever) from an io-wq worker is always blocking / sync. That's the main reason for them existing, to be able to do blocking issue. And that's what they always do. -- Jens Axboe ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 21:52 ` Jens Axboe @ 2024-11-27 21:53 ` Jann Horn 0 siblings, 0 replies; 17+ messages in thread From: Jann Horn @ 2024-11-27 21:53 UTC (permalink / raw) To: Jens Axboe Cc: Kent Overstreet, linux-bcachefs, kernel list, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 10:52 PM Jens Axboe <[email protected]> wrote: > > On Wed, Nov 27, 2024 at 10:16?PM Jens Axboe <[email protected]> wrote: > > > On 11/27/24 2:08 PM, Kent Overstreet wrote: > > > > On Wed, Nov 27, 2024 at 09:44:21PM +0100, Jann Horn wrote: > > > >> On Wed, Nov 27, 2024 at 9:25?PM Kent Overstreet > > > >> <[email protected]> wrote: > > > >>> On Wed, Nov 27, 2024 at 11:09:14AM -0700, Jens Axboe wrote: > > > >>>> On 11/27/24 9:57 AM, Jann Horn wrote: > > > >>>>> Hi! > > > >>>>> > > > >>>>> In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > > > >>>>> to an mm_struct. This pointer is grabbed in bch2_direct_write() > > > >>>>> (without any kind of refcount increment), and used in > > > >>>>> bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > > > >>>>> which are used to enable userspace memory access from kthread context. > > > >>>>> I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > > > >>>>> guarantees that the MM hasn't gone through exit_mmap() yet (normally > > > >>>>> by holding an mmget() reference). > > > >>>>> > > > >>>>> If we reach this codepath via io_uring, do we have a guarantee that > > > >>>>> the mm_struct that called bch2_direct_write() is still alive and > > > >>>>> hasn't yet gone through exit_mmap() when it is accessed from > > > >>>>> bch2_dio_write_continue()? > > > >>>>> > > > >>>>> I don't know the async direct I/O codepath particularly well, so I > > > >>>>> cc'ed the uring maintainers, who probably know this better than me. > > > >>>> > > > >>>> I _think_ this is fine as-is, even if it does look dubious and bcachefs > > > >>>> arguably should grab an mm ref for this just for safety to avoid future > > > >>>> problems. The reason is that bcachefs doesn't set FMODE_NOWAIT, which > > > >>>> means that on the io_uring side it cannot do non-blocking issue of > > > >>>> requests. This is slower as it always punts to an io-wq thread, which > > > >>>> shares the same mm. Hence if the request is alive, there's always a > > > >>>> thread with the same mm alive as well. > > > >>>> > > > >>>> Now if FMODE_NOWAIT was set, then the original task could exit. I'd need > > > >>>> to dig a bit deeper to verify that would always be safe and there's not > > > >>>> a of time today with a few days off in the US looming, so I'll defer > > > >>>> that to next week. It certainly would be fine with an mm ref grabbed. > > > >>> > > > >>> Wouldn't delivery of completions be tied to an address space (not a > > > >>> process) like it is for aio? > > > >> > > > >> An io_uring instance is primarily exposed to userspace as a file > > > >> descriptor, so AFAIK it is possible for the io_uring instance to live > > > >> beyond when the last mmput() happens. io_uring initially only holds an > > > >> mmgrab() reference on the MM (a comment above that explains: "This is > > > >> just grabbed for accounting purposes"), which I think is not enough to > > > >> make it stable enough for kthread_use_mm(); I think in io_uring, only > > > >> the worker threads actually keep the MM fully alive (and AFAIK the > > > >> uring worker threads can exit before the uring instance itself is torn > > > >> down). > > > >> > > > >> To receive io_uring completions, there are multiple ways, but they > > > >> don't use a pointer from the io_uring instance to the MM to access > > > >> userspace memory. Instead, you can have a VMA that points to the > > > >> io_uring instance, created by calling mmap() on the io_uring fd; or > > > >> alternatively, with IORING_SETUP_NO_MMAP, you can have io_uring grab > > > >> references to userspace-provided pages. > > > >> > > > >> On top of that, I think it might currently be possible to use the > > > >> io_uring file descriptor from another task to submit work. (That would > > > >> probably be fairly nonsensical, but I think the kernel does not > > > >> currently prevent it.) > > > > > > > > Ok, that's a wrinkle. > > > > > > I'd argue the fact that you are using an mm from a different process > > > without grabbing a reference is the wrinkle. I just don't think it's a > > > problem right now, but it could be... aio is tied to the mm because of > > > how it does completions, potentially, and hence needs this exit_aio() > > > hack because of that. aio also doesn't care, because it doesn't care > > > about blocking - it'll happily block during issue. > > > > > > > Jens, is it really FMODE_NOWAIT that controls whether we can hit this? A > > > > very cursory glance leads me to suspect "no", it seems like this is a > > > > bug if io_uring is allowed on bcachefs at all. > > > > > > I just looked at bcachefs dio writes, which look to be the only case of > > > this. And yes, for writes, if FMODE_NOWAIT isn't set, then io-wq is > > > always involved for the IO. > > > > I guess it could be an issue if the iocb can outlive the io-wq thread? > > Like, a userspace task creates a uring instance and starts a write; > > the write will be punted to a uring worker because of missing > > FMODE_NOWAIT; then the uring worker enters io_write() and starts a > > write on a kiocb. Can this write initiated from the worker be async? > > And could the uring worker exit (for example, because the userspace > > task exited) while the kiocb is still in flight? > > No, any write (or read, whatever) from an io-wq worker is always > blocking / sync. That's the main reason for them existing, to be able to > do blocking issue. And that's what they always do. Aaah, ok, thanks for the explanation. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: bcachefs: suspicious mm pointer in struct dio_write 2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn 2024-11-27 18:09 ` Jens Axboe @ 2024-11-27 20:23 ` Kent Overstreet 1 sibling, 0 replies; 17+ messages in thread From: Kent Overstreet @ 2024-11-27 20:23 UTC (permalink / raw) To: Jann Horn Cc: linux-bcachefs, kernel list, Jens Axboe, Pavel Begunkov, io-uring On Wed, Nov 27, 2024 at 05:57:03PM +0100, Jann Horn wrote: > Hi! > > In fs/bcachefs/fs-io-direct.c, "struct dio_write" contains a pointer > to an mm_struct. This pointer is grabbed in bch2_direct_write() > (without any kind of refcount increment), and used in > bch2_dio_write_continue() for kthread_use_mm()/kthread_unuse_mm() > which are used to enable userspace memory access from kthread context. > I believe kthread_use_mm()/kthread_unuse_mm() require that the caller > guarantees that the MM hasn't gone through exit_mmap() yet (normally > by holding an mmget() reference). > > If we reach this codepath via io_uring, do we have a guarantee that > the mm_struct that called bch2_direct_write() is still alive and > hasn't yet gone through exit_mmap() when it is accessed from > bch2_dio_write_continue()? > > I don't know the async direct I/O codepath particularly well, so I > cc'ed the uring maintainers, who probably know this better than me. I don't know about io_uring but aio guarantees that outstanding kiocbs are completed before exiting the mm_struct - I would expect some sort of similar guarantee to hold, because otherwise where are we going to deliver the completion? ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-11-27 21:59 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-11-27 16:57 bcachefs: suspicious mm pointer in struct dio_write Jann Horn 2024-11-27 18:09 ` Jens Axboe 2024-11-27 19:43 ` Jann Horn 2024-11-27 20:01 ` Jens Axboe 2024-11-27 20:31 ` Kent Overstreet 2024-11-27 20:25 ` Kent Overstreet 2024-11-27 20:44 ` Jann Horn 2024-11-27 21:08 ` Kent Overstreet 2024-11-27 21:16 ` Jens Axboe 2024-11-27 21:27 ` Kent Overstreet 2024-11-27 21:51 ` Jens Axboe 2024-11-27 21:58 ` Kent Overstreet 2024-11-27 21:59 ` Jens Axboe 2024-11-27 21:39 ` Jann Horn 2024-11-27 21:52 ` Jens Axboe 2024-11-27 21:53 ` Jann Horn 2024-11-27 20:23 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox