From: Kumar Kartikeya Dwivedi <[email protected]>
To: Pavel Begunkov <[email protected]>
Cc: Alexei Starovoitov <[email protected]>,
bpf <[email protected]>, Jens Axboe <[email protected]>,
[email protected], Alexei Starovoitov <[email protected]>,
Daniel Borkmann <[email protected]>,
Andrii Nakryiko <[email protected]>,
Pavel Emelyanov <[email protected]>,
Alexander Mihalicyn <[email protected]>,
Andrei Vagin <[email protected]>,
[email protected], Linux-Fsdevel <[email protected]>
Subject: Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
Date: Sat, 4 Dec 2021 04:46:24 +0530 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Fri, Dec 03, 2021 at 09:22:54PM IST, Pavel Begunkov wrote:
> On 11/19/21 05:24, Alexei Starovoitov wrote:
> > [...]
> >
> > Also I'd like to hear what Jens and Pavel have to say about
> > applicability of CRIU to io_uring in general.
>
Hi Pavel, thanks for taking a look!
> First, we have no way to know what requests are in flight, without it
> CR doesn't make much sense. The most compelling way for me is to add
> a feature to fail all in-flights as it does when is closed. But maybe,
> you already did solve it somehow?
Indeed, as you note, there is no way to currently inspect in-flight requests,
what we can do is wait for a barrier operation to synchronize against all
previous requests.
So for now, my idea is to drain the ring (by waiting for all in-flight requests
to complete), by using a IOSQE_IO_DRAIN IORING_OP_NOP, and then waiting with a
fixed timeout (so that if forward progress depends on a blocked task/ourselves),
we can fail the dumping. This is ofcourse best effort, but it has worked well
for many of the cases I tested so far.
This might have some other issues, e.g. not being able to accommodate all posted
completions in the CQ ring due to unreaped completions from when it was
checkpointed. In that case we can simply give up, since otherwise recreating the
ring as it was becomes very hard if we let it trample over unread items (it is
unclear how I can send in completitions at restore that were in overflow list at
dump).
One idea I had in mind was to add support to post a dummy CQE entry (by
submitting a e.g. IORING_OP_NOP) where the fields of CQE are set during
submission time. This allows faking a completed request, then at restore we can
push all these into the overflow list and project the state as it were if the CQ
ring was full. At dump time it allows us to continually reap completion items.
If we detect that kernel doesn't support overflow, we fail.
Adjustment of the kernel side tail is not as hard (we can use IORING_OP_NOP
completitions to fill it up, then rewrite entries).
There were other operations (like registering buffers) that had similar side
effect of synchronization of ring state (waiting for it to become idle) before
returning to userspace, but that was pre-5.13.
Also we have to do this ring synchronization fairly early during the dump, since
it can lead to addition of other resources (e.g. fds) to the task that then need
to be dumped again.
> There is probably a way to restore registered buffers and files, though
> it may be tough considering that files may not have corresponding fds in
> the userspace, buffers may be unmapped, buffers may come from
> shmem/etc. and other corner cases.
See [0] for some explanation on all that. CRIU also knows if certain VMA comes
from shmem or not (whose restoration is already handled separately).
>
> There are also not covered here pieces of state, SELECT_BUFFER
> buffers, personalities (aka creds), registered eventfd, io-wq
> configuration, etc. I'm assuming you'll be checking them and
> failing CR if any of them is there.
Personalities are not as hard (IIUC), because all the required state is
available through fdinfo. In the PR linked in this thread, there is code to
parse it and restore using the saved credentials (though we might want to
implement UID mapping options, or either let the user do image rewriting for
that, which is a separate concern).
Ideally I'd like to be able to grab this state from the iterator as well, but it
needs atleast a bpf_xa_for_each helper, since io_uring's show_fdinfo skips some
crucial data when it detects contention over uring_lock (and doesn't indicate
this at all) :(. See the conditional printing on 'has_lock'.
SELECT_BUFFER is indeed unhandled rn. I'm contemplating ways on how to extend
the iterator so that it can loop over all items of generic structures like
Xarray in general while taking appropriate locks relevant for the specific hook
in particular. Both personalities registration and IORING_OP_PROVIDE_BUFFERS
insert use an Xarray, so it might make sense to rather add a bpf_xa_for_each
than introducing another iterator, and only mark it as safe for this iterator
context (where appropriate locks e.g. ctx->uring_lock is held).
For registered eventfd, and io-wq, you can look at [0] to see how I am solving
that, TLDR I just map the underlying structure to the open fd in the task. eBPF
is flexible enough to also allow state inspection in case e.g. the corresponding
eventfd is closed, so that we can recreate it, register, and then close again
when restoring. Same with files directly added to the fixed file set, the whole
idea was to bring in eBPF so that dumping these resources is possible when they
are "hidden" from normal view.
[0]: https://lore.kernel.org/bpf/[email protected]
>
> And the last point, there will be some stuff CR of which is
> likely to be a bad idea. E.g. registered dmabuf's,
> pre-registered DMA mappings, zerocopy contexts and so on.
>
Yes, we can just fail the dump for these cases. There are many other cases (in
general) where we just have to give up.
> IOW, if the first point is solved, there may be a subset of ring
> setups that can probably be CR. That should cover a good amount
> of cases. I don't have a strong opinion on the whole thing,
> I guess it depends on the amount of problems to implement
> in-flight cancellations.
>
> --
> Pavel Begunkov
--
Kartikeya
next prev parent reply other threads:[~2021-12-03 23:16 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-16 5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21 ` Yonghong Song
2021-11-18 18:28 ` Kumar Kartikeya Dwivedi
2021-11-18 19:13 ` Yonghong Song
2021-11-18 22:02 ` Alexei Starovoitov
2021-11-19 4:15 ` Kumar Kartikeya Dwivedi
2021-11-19 4:44 ` Kumar Kartikeya Dwivedi
2021-11-19 4:56 ` Alexei Starovoitov
2021-11-19 5:16 ` Kumar Kartikeya Dwivedi
2021-11-19 5:24 ` Alexei Starovoitov
2021-11-19 6:12 ` Kumar Kartikeya Dwivedi
2021-12-03 15:52 ` Pavel Begunkov
2021-12-03 23:16 ` Kumar Kartikeya Dwivedi [this message]
2021-11-16 5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35 ` kernel test robot
2021-11-17 13:39 ` kernel test robot
2021-11-18 17:27 ` Yonghong Song
2021-11-18 18:30 ` Kumar Kartikeya Dwivedi
2021-11-18 19:18 ` Yonghong Song
2021-11-18 19:22 ` Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54 ` Yonghong Song
2021-11-18 18:33 ` Kumar Kartikeya Dwivedi
2021-11-18 19:21 ` Yonghong Song
2021-11-16 5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16 5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox