From: Daniel Vetter <[email protected]>
To: Christian Brauner <[email protected]>
Cc: Daniel Vetter <[email protected]>,
Linus Torvalds <[email protected]>,
Al Viro <[email protected]>,
[email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected],
[email protected], [email protected],
[email protected],
[email protected],
[email protected]
Subject: Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes
Date: Tue, 7 May 2024 12:58:14 +0200 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <20240506-zweisamkeit-zinsniveau-615a2e6d7c67@brauner>
On Mon, May 06, 2024 at 04:46:54PM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> > On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > > On Sun, 5 May 2024 at 13:30, Al Viro <[email protected]> wrote:
> > > >
> > > > 0. special-cased ->f_count rule for ->poll() is a wart and it's
> > > > better to get rid of it.
> > > >
> > > > 1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > > git rm taken to it. Short of that, by all means, let's grab reference
> > > > in there around the call of vfs_poll() (see (0)).
> > >
> > > Agreed on 0/1.
> > >
> > > > 2. having ->poll() instances grab extra references to file passed
> > > > to them is not something that should be encouraged; there's a plenty
> > > > of potential problems, and "caller has it pinned, so we are fine with
> > > > grabbing extra refs" is nowhere near enough to eliminate those.
> > >
> > > So it's not clear why you hate it so much, since those extra
> > > references are totally normal in all the other VFS paths.
> > >
> > > I mean, they are perhaps not the *common* case, but we have a lot of
> > > random get_file() calls sprinkled around in various places when you
> > > end up passing a file descriptor off to some asynchronous operation
> > > thing.
> > >
> > > Yeah, I think most of them tend to be special operations (eg the tty
> > > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > > is *that* different from vfs_poll. Different operation, not somehow
> > > "one is more special than the other".
> > >
> > > cachefiles and backing-file does it for regular IO, and drop it at IO
> > > completion - not that different from what dma-buf does. It's in
> > > ->read_iter() rather than ->poll(), but again: different operations,
> > > but not "one of them is somehow fundamentally different".
> > >
> > > > 3. dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > > but they do look fishy. That has nothing to do with epoll.
> > >
> > > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > > own fundamental data structure, and replaces that by refcounting the
> > > 'struct file' that *points* to it instead.
> > >
> > > And it is a bit odd, but it actually makes some amount of sense,
> > > because then what it passes around is that file pointer (and it allows
> > > passing it around from user space *as* that file).
> > >
> > > And honestly, if you look at why it then needs to add its refcount to
> > > it all, it actually makes sense. dma-bufs have this notion of
> > > "fences" that are basically completion points for the asynchronous
> > > DMA. Doing a "poll()" operation will add a note to the fence to get
> > > that wakeup when it's done.
> > >
> > > And yes, logically it takes a ref to the "struct dma_buf", but because
> > > of how the lifetime of the dma_buf is associated with the lifetime of
> > > the 'struct file', that then turns into taking a ref on the file.
> > >
> > > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > > lifetime of the dma_buf to the lifetime of a file that is passed along
> > > makes _sense_ for that use.
> > >
> > > I'm sure dma-bufs could add another level of refcounting on the
> > > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > > it's not clear to me what the advantage would really be, or why it
> > > would be wrong to re-use a refcount that is already there.
> >
> > So there is generally another refcount, because dma_buf is just the
> > cross-driver interface to some kind of real underlying buffer object from
> > the various graphics related subsystems we have.
> >
> > And since it's a pure file based api thing that ceases to serve any
> > function once the fd/file is gone we tied all the dma_buf refcounting to
> > the refcount struct file already maintains. But the underlying buffer
> > object can easily outlive the dma_buf, and over the lifetime of an
> > underlying buffer object you might actually end up creating different
> > dma_buf api wrappers for it (but at least in drm we guarantee there's at
> > most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> > don't particularly like and isn't really needed).
> >
> > But we could add another refcount, it just means we have 3 of those then
> > when only really 2 are needed.
>
> Fwiw, the TTM thing described upthread and in the other thread really
> tries hard to work around the dma_buf == file lifetime choice by hooking
> into the dma-buf specific release function so it can access the dmabuf
> and then the file. All that seems like a pretty error prone thing to me.
> So a separate refcount for dma_buf wouldn't be the worst as that would
> allow that TTM thing to benefit and remove that nasty hacking into your
> generic dma_buf ops. But maybe I'm the only one who sees it that way and
> I'm certainly not familiar enough with dma-buf.
So the tricky part is the uniqueness requirement drm has for buffer
objects (and hence dma_buf wrappers), which together with the refcounting
makes dma_buf quite tricky:
- dma_buf needs to hold some reference onto the underlying object, or it
wont work
- but you're not allowed to just create a new dma_buf every time someone
exports an underlying object to a dma_buf, because that would break the
uniqueness requirement. Which means the underlying object must also hold
some kind of reference to its dma_buf, if it exists. So that on buffer
export it can just increment the refcount for that and return it,
instead of creating a new one.
Which would be a reference loop that never gets freed, so you need one of
two tricks:
- Either a weak reference, i.e. just a pointer plus
atomic_inc_unless_zero trickery like ttm does. Splitting that refcount
into more refcounts doesn't fundamentally solve the problem, it just
adds even more refcounts.
- Or you do what all other drm drivers do in drm_prime.c do and careful
clean up the dma_buf re-export cache when the userspace references (but
not all kernel internal ones) disappear, to unbreak that reference loop.
This needs to be done with extreme care and took a lot of screaming to
get right, because if you have a race you might end up breaking the
uniqueness requirement and have two dma_buf floating around.
So neither of these solutions really are simple, but I agree with you that
the atomic_inc_unless_zero trickery is less simple. It's definitely not
cool that it's done by digging around in struct file internals.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
next prev parent reply other threads:[~2024-05-07 10:58 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-08 8:26 [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove syzbot
2024-04-15 14:31 ` Jens Axboe
2024-04-15 14:57 ` Pavel Begunkov
2024-05-03 11:54 ` Bui Quang Minh
2024-05-03 18:26 ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Kees Cook
2024-05-03 18:49 ` Jens Axboe
2024-05-03 19:22 ` Kees Cook
2024-05-03 19:35 ` Jens Axboe
2024-05-03 19:59 ` Kees Cook
2024-05-03 20:28 ` Kees Cook
2024-05-03 21:11 ` Al Viro
2024-05-03 21:24 ` Linus Torvalds
2024-05-03 21:30 ` Al Viro
2024-05-06 17:46 ` Stefan Metzmacher
2024-05-06 18:17 ` Linus Torvalds
2024-05-08 8:47 ` David Laight
2024-05-03 21:36 ` Al Viro
2024-05-03 21:42 ` Linus Torvalds
2024-05-03 21:53 ` Al Viro
2024-05-06 12:23 ` Daniel Vetter
2024-05-04 9:59 ` Christian Brauner
2024-05-03 21:11 ` [PATCH] epoll: try to be a _bit_ better about file lifetimes Linus Torvalds
2024-05-03 21:24 ` Al Viro
2024-05-03 21:33 ` Linus Torvalds
2024-05-03 21:45 ` Al Viro
2024-05-03 21:52 ` Linus Torvalds
2024-05-03 22:01 ` Al Viro
2024-05-03 22:07 ` Al Viro
2024-05-03 23:16 ` Linus Torvalds
2024-05-03 23:39 ` Al Viro
2024-05-03 23:54 ` Linus Torvalds
2024-05-04 10:44 ` Christian Brauner
2024-05-03 22:46 ` Kees Cook
2024-05-03 23:03 ` Al Viro
2024-05-03 23:23 ` Kees Cook
2024-05-03 23:41 ` Linus Torvalds
2024-05-04 9:19 ` Christian Brauner
2024-05-06 12:37 ` Daniel Vetter
2024-05-04 9:37 ` Christian Brauner
2024-05-04 15:32 ` Linus Torvalds
2024-05-04 15:40 ` Linus Torvalds
2024-05-04 15:53 ` Linus Torvalds
2024-05-05 19:46 ` Al Viro
2024-05-05 20:03 ` Linus Torvalds
2024-05-05 20:30 ` Al Viro
2024-05-05 20:53 ` Linus Torvalds
2024-05-06 12:47 ` Daniel Vetter
2024-05-06 14:46 ` Christian Brauner
2024-05-07 10:58 ` Daniel Vetter [this message]
2024-05-06 16:15 ` Christian König
2024-05-05 10:50 ` Christian Brauner
2024-05-05 16:46 ` Linus Torvalds
2024-05-05 17:55 ` [PATCH v2] epoll: be " Linus Torvalds
2024-05-05 18:04 ` Jens Axboe
2024-05-05 20:01 ` David Laight
2024-05-05 20:16 ` Linus Torvalds
2024-05-05 20:12 ` [PATCH] epoll: try to be a _bit_ " Al Viro
2024-05-06 8:45 ` Christian Brauner
2024-05-06 9:26 ` Christian Brauner
2024-05-06 14:19 ` Christian Brauner
2024-05-07 21:02 ` David Laight
2024-05-04 18:20 ` Linus Torvalds
2024-05-06 14:29 ` [Linaro-mm-sig] " Christian König
2024-05-07 11:02 ` Daniel Vetter
2024-05-07 16:46 ` Linus Torvalds
2024-05-07 17:45 ` Christian König
2024-05-08 7:51 ` Michel Dänzer
2024-05-08 7:59 ` Christian König
2024-05-08 8:23 ` Christian Brauner
2024-05-08 9:10 ` Christian König
2024-05-07 18:04 ` Daniel Vetter
2024-05-07 19:07 ` Linus Torvalds
2024-05-08 5:55 ` Christian König
2024-05-08 8:32 ` Daniel Vetter
2024-05-08 10:16 ` Christian Brauner
2024-05-08 8:05 ` Christian Brauner
2024-05-08 16:19 ` Linus Torvalds
2024-05-08 17:14 ` Linus Torvalds
2024-05-09 11:38 ` Christian Brauner
2024-05-09 15:48 ` Linus Torvalds
2024-05-10 6:33 ` Christian Brauner
2024-05-08 10:08 ` Christian Brauner
2024-05-08 15:45 ` Daniel Vetter
2024-05-10 10:55 ` Christian Brauner
2024-05-11 18:25 ` David Laight
2024-05-05 17:31 ` Jens Axboe
2024-05-04 9:45 ` get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove) Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox