From: Linus Torvalds <[email protected]>
To: Stefan Metzmacher <[email protected]>, Jens Axboe <[email protected]>
Cc: linux-fsdevel <[email protected]>,
Linux API Mailing List <[email protected]>,
io-uring <[email protected]>,
"[email protected]" <[email protected]>,
Al Viro <[email protected]>,
Samba Technical <[email protected]>
Subject: Re: copy on write for splice() from file to pipe?
Date: Thu, 9 Feb 2023 08:41:02 -0800 [thread overview]
Message-ID: <CAHk-=wj8rthcQ9gQbvkMzeFt0iymq+CuOzmidx3Pm29Lg+W0gg@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
Adding Jens, because he's one of the main splice people. You do seem
to be stepping on his work ;)
Jens, see
https://lore.kernel.org/lkml/[email protected]
On Thu, Feb 9, 2023 at 5:56 AM Stefan Metzmacher <[email protected]> wrote:
>
> So we have two cases:
>
> 1. network -> socket -> splice -> pipe -> splice -> file -> storage
>
> 2. storage -> file -> splice -> pipe -> splice -> socket -> network
>
> With 1. I guess everything can work reliable [..]
>
> But with 2. there's a problem, as the pages from the file,
> which are spliced into the pipe are still shared without
> copy on write with the file(system).
Well, honestly, that's really the whole point of splice. It was
designed to be a way to share the storage data without having to go
through a copy.
> I'm wondering if there's a possible way out of this, maybe triggered by a new
> flag passed to splice.
Not really.
So basically, you cannot do "copy on write" on a page cache page,
because that breaks sharing.
You *want* the sharing to break, but that's because you're violating
what splice() was for, but think about all the cases where somebody is
just using mmap() and expects to see the file changes.
You also aren't thinking of the case where the page is already mapped
writably, and user processes may be changing the data at any time.
> I looked through the code and noticed the existence of IOMAP_F_SHARED.
Yeah, no. That's a hacky filesystem thing. It's not even a flag in
anything core like 'struct page', it's just entirely internal to the
filesystem itself.
> Is there any other way we could archive something like this?
I suspect you simply want to copy it at splice time, rather than push
the page itself into the pipe as we do in copy_page_to_iter_pipe().
Because the whole point of zero-copy really is that zero copy. And the
whole point of splice() was to *not* complicate the rest of the system
over-much, while allowing special cases.
Linux is not the heap of bad ideas that is Hurd that does various
versioning etc, and that made copy-on-write a first-class citizen
because it uses the concept of "immutable mapped data" for reads and
writes.
Now, I do see a couple of possible alternatives to "just create a stable copy".
For example, we very much have the notion of "confirm buffer data
before copying". It's used for things like "I started the IO on the
page, but the IO failed with an error, so even though I gave you a
splice buffer, it turns out you can't use it".
And I do wonder if we could introduce a notion of "optimistic splice",
where the splice works exactly the way it does now (you get a page
reference), but the "confirm" phase could check whether something has
changed in that mapping (using the file versioning or whatever - I'm
hand-waving) and simply fail the confirm.
That would mean that the "splice to socket" part would fail in your
chain, and you'd have to re-try it. But then the onus would be on
*you* as a splicer, not on the rest of the system to fix up your
special case.
That idea sounds fairly far out there, and complicated and maybe not
usable. So I'm just throwing it out as a "let's try to think of
alternative solutions".
Anybody?
Linus
next prev parent reply other threads:[~2023-02-09 16:41 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-09 13:55 copy on write for splice() from file to pipe? Stefan Metzmacher
2023-02-09 14:11 ` Matthew Wilcox
2023-02-09 14:29 ` Stefan Metzmacher
2023-02-09 16:41 ` Linus Torvalds [this message]
2023-02-09 19:17 ` Stefan Metzmacher
2023-02-09 19:36 ` Linus Torvalds
2023-02-09 19:48 ` Linus Torvalds
2023-02-09 20:33 ` Jeremy Allison
2023-02-10 20:45 ` Stefan Metzmacher
2023-02-10 20:51 ` Linus Torvalds
2023-02-10 2:16 ` Dave Chinner
2023-02-10 4:06 ` Dave Chinner
2023-02-10 4:44 ` Matthew Wilcox
2023-02-10 6:57 ` Dave Chinner
2023-02-10 15:14 ` Andy Lutomirski
2023-02-10 16:33 ` Linus Torvalds
2023-02-10 17:57 ` Andy Lutomirski
2023-02-10 18:19 ` Jeremy Allison
2023-02-10 19:29 ` Stefan Metzmacher
2023-02-10 18:37 ` Linus Torvalds
2023-02-10 19:01 ` Andy Lutomirski
2023-02-10 19:18 ` Linus Torvalds
2023-02-10 19:27 ` Jeremy Allison
2023-02-10 19:42 ` Stefan Metzmacher
2023-02-10 19:42 ` Linus Torvalds
2023-02-10 19:54 ` Stefan Metzmacher
2023-02-10 19:29 ` Linus Torvalds
2023-02-13 9:07 ` Herbert Xu
2023-02-10 19:55 ` Andy Lutomirski
2023-02-10 20:27 ` Linus Torvalds
2023-02-10 20:32 ` Jens Axboe
2023-02-10 20:36 ` Linus Torvalds
2023-02-10 20:39 ` Jens Axboe
2023-02-10 20:44 ` Linus Torvalds
2023-02-10 20:50 ` Jens Axboe
2023-02-10 21:14 ` Andy Lutomirski
2023-02-10 21:27 ` Jens Axboe
2023-02-10 21:51 ` Jens Axboe
2023-02-10 22:08 ` Linus Torvalds
2023-02-10 22:16 ` Jens Axboe
2023-02-10 22:17 ` Linus Torvalds
2023-02-10 22:25 ` Jens Axboe
2023-02-10 22:35 ` Linus Torvalds
2023-02-10 22:51 ` Jens Axboe
2023-02-11 3:18 ` Ming Lei
2023-02-11 6:17 ` Ming Lei
2023-02-11 14:13 ` Jens Axboe
2023-02-11 15:05 ` Ming Lei
2023-02-11 15:33 ` Jens Axboe
2023-02-11 18:57 ` Linus Torvalds
2023-02-12 2:46 ` Jens Axboe
2023-02-10 4:47 ` Linus Torvalds
2023-02-10 6:19 ` Dave Chinner
2023-02-10 17:23 ` Linus Torvalds
2023-02-10 17:47 ` Linus Torvalds
2023-02-13 9:28 ` Herbert Xu
2023-02-10 22:41 ` David Laight
2023-02-10 22:51 ` Jens Axboe
2023-02-13 9:30 ` Herbert Xu
2023-02-13 9:25 ` Herbert Xu
2023-02-13 18:01 ` Andy Lutomirski
2023-02-14 1:22 ` Herbert Xu
2023-02-17 23:13 ` Andy Lutomirski
2023-02-20 4:54 ` Herbert Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHk-=wj8rthcQ9gQbvkMzeFt0iymq+CuOzmidx3Pm29Lg+W0gg@mail.gmail.com' \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox