From: Andy Lutomirski <[email protected]>
To: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>,
Dave Chinner <[email protected]>,
Matthew Wilcox <[email protected]>,
Stefan Metzmacher <[email protected]>, Jens Axboe <[email protected]>,
linux-fsdevel <[email protected]>,
Linux API Mailing List <[email protected]>,
io-uring <[email protected]>,
"[email protected]" <[email protected]>,
Al Viro <[email protected]>,
Samba Technical <[email protected]>
Subject: Re: copy on write for splice() from file to pipe?
Date: Fri, 10 Feb 2023 09:57:20 -0800 [thread overview]
Message-ID: <CALCETrU-9Wcb_zCsVWr24V=uCA0+c6x359UkJBOBgkbq+UHAMA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wj66F6CdJUAAjqigXMBy7gHquFMzPNAwKCgkrb2mF6U7w@mail.gmail.com>
On Fri, Feb 10, 2023 at 8:34 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Frankly, I really don't like having non-immutable data in a pipe.
>
> That statement is completely nonsensical.
I know what splice() is. I'm trying to make the point that it may not
be the right API for most (all?) of its use cases, that we can maybe
do better, and that we should maybe even consider deprecating (and
simplifying and the cost of performance) splice in the moderately near
future. And I think I agree with you on most of what you're saying.
> It was literally designed to be "look, we want zero-copy networking,
> and we could do 'sendfile()' by mmap'ing the file, but mmap - and
> particularly munmap - is too expensive, so we map things into kernel
> buffers instead".
Indeed. mmap() + sendfile() + munmap() is extraordinarily expensive
and is not the right solution to zero-copy networking.
>
> So saying "I really don't like having non-immutable data in a pipe" is
> complete nonsense. It's syntactically correct English, but it makes no
> conceptual sense.
>
> You can say "I don't like 'splice()'". That's fine. I used to think
> splice was a really cool concept, but I kind of hate it these days.
> Not liking splice() makes a ton of sense.
>
> But given splice, saying "I don't like non-immutable data" really is
> complete nonsense.
I am saying exactly what I meant. Obviously mutable data exists. I'm
saying that *putting it in a pipe* *while it's still mutable* is not
good. Which implies that I don't think splice() is good. No offense.
I am *not* saying that the mere existence of mutable data is a problem.
> That's not something specific to "splice()". It's fundamental to the
> whole *concept* of zero-copy. If you don't want copies, and the source
> file changes, then you see those changes.
Of course! A user program copying data from a file to a network
fundamentally does this:
Step 1: start the process.
Step 2: data goes out to the actual wire or a buffer on the NIC or is
otherwise in a place other than page cache, and the kernel reports
completion.
There are many ways to make this happen. Step 1 could be starting
read() and step 2 could be send() returning. Step 1 could be be
sticking something in an io_uring queue and step 2 could be reporting
completion. Step 1 could be splice()ing to a pipe and step 2 could be
a splice from the pipe to a socket completing (and maybe even later
when the data actually goes out).
*Obviously* any change to the file between steps 1 and 2 may change
the data that goes out the wire.
> So the data lifetime - even just on just one side - can _easily_ be
> "multiple seconds" even when things are normal, and if you have actual
> network connectivity issues we are easily talking minutes.
True.
But splice is extra nasty: step 1 happens potentially arbitrarily long
before step 2, and the kernel doesn't even know which socket the data
is destined for in step 1. So step 1 can't usefully return
-EWOULDBLOCK, for example. And it's awkward for the kernel to report
errors, because steps 1 and 2 are so disconnected. And I'm not
convinced there's any corresponding benefit.
In any case, maybe io_uring gives an opportunity to do much better.
io_uring makes it *efficient* for largish numbers of long-running
operations to all be pending at once. Would an API like this work
better (very handwavy -- I make absolutely no promises that this is
compatible with existing users -- new opcodes might be needed):
Submit IORING_OP_SPLICE from a *file* to a socket: this tells the
kernel to kindly send data from the file in question to the network.
Writes to the file before submission will be reflected in the data
sent. Writes after submission may or may not be reflected. (This is
step 1 above.)
The operation completes (and is reported in the CQ) only after the
kernel knows that the data has been snapshotted (step 2 above). So
completion can be reported when the data is DMAed out or when it's
checksummed-and-copied or if the kernel decides to copy it for any
other reason *and* the kernel knows that it won't need to read the
data again for possible retransmission. As you said, this could
easily take minutes, but that seems maybe okay to me.
(And if Samba needs to make sure that future writes don't change the
outgoing data even two seconds later when the data has been sent but
not acked, then maybe a fancy API could be added to help, or maybe
Samba shouldn't be using zero copy IO in the first place!)
If the file is truncated or some other problem happens, the operation can fail.
I don't know how easy or hard this is to implement, but it seems like
it would be quite pleasant to *use* from user code, it ought to be
even faster than splice-to-pipe-then-splice-to-socket (simply because
there is less bookkeeping), and it doesn't seem like any file change
tracking would be needed in the kernel.
If this works and becomes popular enough, splice-from-file-to-pipe
could *maybe* be replaced in the kernel with a plain copy.
--Andy
next prev parent reply other threads:[~2023-02-10 17:57 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-09 13:55 copy on write for splice() from file to pipe? Stefan Metzmacher
2023-02-09 14:11 ` Matthew Wilcox
2023-02-09 14:29 ` Stefan Metzmacher
2023-02-09 16:41 ` Linus Torvalds
2023-02-09 19:17 ` Stefan Metzmacher
2023-02-09 19:36 ` Linus Torvalds
2023-02-09 19:48 ` Linus Torvalds
2023-02-09 20:33 ` Jeremy Allison
2023-02-10 20:45 ` Stefan Metzmacher
2023-02-10 20:51 ` Linus Torvalds
2023-02-10 2:16 ` Dave Chinner
2023-02-10 4:06 ` Dave Chinner
2023-02-10 4:44 ` Matthew Wilcox
2023-02-10 6:57 ` Dave Chinner
2023-02-10 15:14 ` Andy Lutomirski
2023-02-10 16:33 ` Linus Torvalds
2023-02-10 17:57 ` Andy Lutomirski [this message]
2023-02-10 18:19 ` Jeremy Allison
2023-02-10 19:29 ` Stefan Metzmacher
2023-02-10 18:37 ` Linus Torvalds
2023-02-10 19:01 ` Andy Lutomirski
2023-02-10 19:18 ` Linus Torvalds
2023-02-10 19:27 ` Jeremy Allison
2023-02-10 19:42 ` Stefan Metzmacher
2023-02-10 19:42 ` Linus Torvalds
2023-02-10 19:54 ` Stefan Metzmacher
2023-02-10 19:29 ` Linus Torvalds
2023-02-13 9:07 ` Herbert Xu
2023-02-10 19:55 ` Andy Lutomirski
2023-02-10 20:27 ` Linus Torvalds
2023-02-10 20:32 ` Jens Axboe
2023-02-10 20:36 ` Linus Torvalds
2023-02-10 20:39 ` Jens Axboe
2023-02-10 20:44 ` Linus Torvalds
2023-02-10 20:50 ` Jens Axboe
2023-02-10 21:14 ` Andy Lutomirski
2023-02-10 21:27 ` Jens Axboe
2023-02-10 21:51 ` Jens Axboe
2023-02-10 22:08 ` Linus Torvalds
2023-02-10 22:16 ` Jens Axboe
2023-02-10 22:17 ` Linus Torvalds
2023-02-10 22:25 ` Jens Axboe
2023-02-10 22:35 ` Linus Torvalds
2023-02-10 22:51 ` Jens Axboe
2023-02-11 3:18 ` Ming Lei
2023-02-11 6:17 ` Ming Lei
2023-02-11 14:13 ` Jens Axboe
2023-02-11 15:05 ` Ming Lei
2023-02-11 15:33 ` Jens Axboe
2023-02-11 18:57 ` Linus Torvalds
2023-02-12 2:46 ` Jens Axboe
2023-02-10 4:47 ` Linus Torvalds
2023-02-10 6:19 ` Dave Chinner
2023-02-10 17:23 ` Linus Torvalds
2023-02-10 17:47 ` Linus Torvalds
2023-02-13 9:28 ` Herbert Xu
2023-02-10 22:41 ` David Laight
2023-02-10 22:51 ` Jens Axboe
2023-02-13 9:30 ` Herbert Xu
2023-02-13 9:25 ` Herbert Xu
2023-02-13 18:01 ` Andy Lutomirski
2023-02-14 1:22 ` Herbert Xu
2023-02-17 23:13 ` Andy Lutomirski
2023-02-20 4:54 ` Herbert Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALCETrU-9Wcb_zCsVWr24V=uCA0+c6x359UkJBOBgkbq+UHAMA@mail.gmail.com' \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox