From: Stanislav Fomichev <stfomichev@gmail.com>
To: David Laight <david.laight.linux@gmail.com>
Cc: "Stefan Metzmacher" <metze@samba.org>,
"Breno Leitao" <leitao@debian.org>,
"Linus Torvalds" <torvalds@linux-foundation.org>,
"Jens Axboe" <axboe@kernel.dk>,
"Pavel Begunkov" <asml.silence@gmail.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Christoph Hellwig" <hch@lst.de>,
"Karsten Keil" <isdn@linux-pingi.de>,
"Ayush Sawal" <ayush.sawal@chelsio.com>,
"Andrew Lunn" <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Paolo Abeni" <pabeni@redhat.com>,
"Simon Horman" <horms@kernel.org>,
"Kuniyuki Iwashima" <kuniyu@amazon.com>,
"Willem de Bruijn" <willemb@google.com>,
"David Ahern" <dsahern@kernel.org>,
"Marcelo Ricardo Leitner" <marcelo.leitner@gmail.com>,
"Xin Long" <lucien.xin@gmail.com>,
"Neal Cardwell" <ncardwell@google.com>,
"Joerg Reuter" <jreuter@yaina.de>,
"Marcel Holtmann" <marcel@holtmann.org>,
"Johan Hedberg" <johan.hedberg@gmail.com>,
"Luiz Augusto von Dentz" <luiz.dentz@gmail.com>,
"Oliver Hartkopp" <socketcan@hartkopp.net>,
"Marc Kleine-Budde" <mkl@pengutronix.de>,
"Robin van der Gracht" <robin@protonic.nl>,
"Oleksij Rempel" <o.rempel@pengutronix.de>,
kernel@pengutronix.de, "Alexander Aring" <alex.aring@gmail.com>,
"Stefan Schmidt" <stefan@datenfreihafen.org>,
"Miquel Raynal" <miquel.raynal@bootlin.com>,
"Alexandra Winter" <wintera@linux.ibm.com>,
"Thorsten Winkler" <twinkler@linux.ibm.com>,
"James Chapman" <jchapman@katalix.com>,
"Jeremy Kerr" <jk@codeconstruct.com.au>,
"Matt Johnston" <matt@codeconstruct.com.au>,
"Matthieu Baerts" <matttbe@kernel.org>,
"Mat Martineau" <martineau@kernel.org>,
"Geliang Tang" <geliang@kernel.org>,
"Krzysztof Kozlowski" <krzk@kernel.org>,
"Remi Denis-Courmont" <courmisch@gmail.com>,
"Allison Henderson" <allison.henderson@oracle.com>,
"David Howells" <dhowells@redhat.com>,
"Marc Dionne" <marc.dionne@auristor.com>,
"Wenjia Zhang" <wenjia@linux.ibm.com>,
"Jan Karcher" <jaka@linux.ibm.com>,
"D. Wythe" <alibuda@linux.alibaba.com>,
"Tony Lu" <tonylu@linux.alibaba.com>,
"Wen Gu" <guwen@linux.alibaba.com>,
"Jon Maloy" <jmaloy@redhat.com>,
"Boris Pismenny" <borisp@nvidia.com>,
"John Fastabend" <john.fastabend@gmail.com>,
"Stefano Garzarella" <sgarzare@redhat.com>,
"Martin Schiller" <ms@dev.tdt.de>,
"Björn Töpel" <bjorn@kernel.org>,
"Magnus Karlsson" <magnus.karlsson@intel.com>,
"Maciej Fijalkowski" <maciej.fijalkowski@intel.com>,
"Jonathan Lemon" <jonathan.lemon@gmail.com>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-sctp@vger.kernel.org, linux-hams@vger.kernel.org,
linux-bluetooth@vger.kernel.org, linux-can@vger.kernel.org,
dccp@vger.kernel.org, linux-wpan@vger.kernel.org,
linux-s390@vger.kernel.org, mptcp@lists.linux.dev,
linux-rdma@vger.kernel.org, rds-devel@oss.oracle.com,
linux-afs@lists.infradead.org,
tipc-discussion@lists.sourceforge.net,
virtualization@lists.linux.dev, linux-x25@vger.kernel.org,
bpf@vger.kernel.org, isdn4linux@listserv.isdn4linux.de,
io-uring@vger.kernel.org
Subject: Re: [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt()
Date: Wed, 2 Apr 2025 14:21:35 -0700 [thread overview]
Message-ID: <Z-2qX_N2-jpMYSIy@mini-arch> (raw)
In-Reply-To: <20250402214638.0b5eed55@pumpkin>
On 04/02, David Laight wrote:
> On Wed, 2 Apr 2025 07:19:46 -0700
> Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> > On 04/02, David Laight wrote:
> > > On Wed, 2 Apr 2025 00:53:58 +0200
> > > Stefan Metzmacher <metze@samba.org> wrote:
> > >
> > > > Am 02.04.25 um 00:04 schrieb Stanislav Fomichev:
> > > > > On 04/01, Stefan Metzmacher wrote:
> > > > >> Am 01.04.25 um 17:45 schrieb Stanislav Fomichev:
> > > > >>> On 04/01, Breno Leitao wrote:
> > > > >>>> On Tue, Apr 01, 2025 at 03:48:58PM +0200, Stefan Metzmacher wrote:
> > > > >>>>> Am 01.04.25 um 15:37 schrieb Stefan Metzmacher:
> > > > >>>>>> Am 01.04.25 um 10:19 schrieb Stefan Metzmacher:
> > > > >>>>>>> Am 31.03.25 um 23:04 schrieb Stanislav Fomichev:
> > > > >>>>>>>> On 03/31, Stefan Metzmacher wrote:
> > > > >>>>>>>>> The motivation for this is to remove the SOL_SOCKET limitation
> > > > >>>>>>>>> from io_uring_cmd_getsockopt().
> > > > >>>>>>>>>
> > > > >>>>>>>>> The reason for this limitation is that io_uring_cmd_getsockopt()
> > > > >>>>>>>>> passes a kernel pointer as optlen to do_sock_getsockopt()
> > > > >>>>>>>>> and can't reach the ops->getsockopt() path.
> > > > >>>>>>>>>
> > > > >>>>>>>>> The first idea would be to change the optval and optlen arguments
> > > > >>>>>>>>> to the protocol specific hooks also to sockptr_t, as that
> > > > >>>>>>>>> is already used for setsockopt() and also by do_sock_getsockopt()
> > > > >>>>>>>>> sk_getsockopt() and BPF_CGROUP_RUN_PROG_GETSOCKOPT().
> > > > >>>>>>>>>
> > > > >>>>>>>>> But as Linus don't like 'sockptr_t' I used a different approach.
> > > > >>>>>>>>>
> > > > >>>>>>>>> @Linus, would that optlen_t approach fit better for you?
> > > > >>>>>>>>
> > > > >>>>>>>> [..]
> > > > >>>>>>>>
> > > > >>>>>>>>> Instead of passing the optlen as user or kernel pointer,
> > > > >>>>>>>>> we only ever pass a kernel pointer and do the
> > > > >>>>>>>>> translation from/to userspace in do_sock_getsockopt().
> > > > >>>>>>>>
> > > > >>>>>>>> At this point why not just fully embrace iov_iter? You have the size
> > > > >>>>>>>> now + the user (or kernel) pointer. Might as well do
> > > > >>>>>>>> s/sockptr_t/iov_iter/ conversion?
> > > > >>>>>>>
> > > > >>>>>>> I think that would only be possible if we introduce
> > > > >>>>>>> proto[_ops].getsockopt_iter() and then convert the implementations
> > > > >>>>>>> step by step. Doing it all in one go has a lot of potential to break
> > > > >>>>>>> the uapi. I could try to convert things like socket, ip and tcp myself, but
> > > > >>>>>>> the rest needs to be converted by the maintainer of the specific protocol,
> > > > >>>>>>> as it needs to be tested. As there are crazy things happening in the existing
> > > > >>>>>>> implementations, e.g. some getsockopt() implementations use optval as in and out
> > > > >>>>>>> buffer.
> > > > >>>>>>>
> > > > >>>>>>> I first tried to convert both optval and optlen of getsockopt to sockptr_t,
> > > > >>>>>>> and that showed that touching the optval part starts to get complex very soon,
> > > > >>>>>>> see https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=141912166473bf8843ec6ace76dc9c6945adafd1
> > > > >>>>>>> (note it didn't converted everything, I gave up after hitting
> > > > >>>>>>> sctp_getsockopt_peer_addrs and sctp_getsockopt_local_addrs.
> > > > >>>>>>> sctp_getsockopt_context, sctp_getsockopt_maxseg, sctp_getsockopt_associnfo and maybe
> > > > >>>>>>> more are the ones also doing both copy_from_user and copy_to_user on optval)
> > > > >>>>>>>
> > > > >>>>>>> I come also across one implementation that returned -ERANGE because *optlen was
> > > > >>>>>>> too short and put the required length into *optlen, which means the returned
> > > > >>>>>>> *optlen is larger than the optval buffer given from userspace.
> > > > >>>>>>>
> > > > >>>>>>> Because of all these strange things I tried to do a minimal change
> > > > >>>>>>> in order to get rid of the io_uring limitation and only converted
> > > > >>>>>>> optlen and leave optval as is.
> > > > >>>>>>>
> > > > >>>>>>> In order to have a patchset that has a low risk to cause regressions.
> > > > >>>>>>>
> > > > >>>>>>> But as alternative introducing a prototype like this:
> > > > >>>>>>>
> > > > >>>>>>> int (*getsockopt_iter)(struct socket *sock, int level, int optname,
> > > > >>>>>>> struct iov_iter *optval_iter);
> > > > >>>>>>>
> > > > >>>>>>> That returns a non-negative value which can be placed into *optlen
> > > > >>>>>>> or negative value as error and *optlen will not be changed on error.
> > > > >>>>>>> optval_iter will get direction ITER_DEST, so it can only be written to.
> > > > >>>>>>>
> > > > >>>>>>> Implementations could then opt in for the new interface and
> > > > >>>>>>> allow do_sock_getsockopt() work also for the io_uring case,
> > > > >>>>>>> while all others would still get -EOPNOTSUPP.
> > > > >>>>>>>
> > > > >>>>>>> So what should be the way to go?
> > > > >>>>>>
> > > > >>>>>> Ok, I've added the infrastructure for getsockopt_iter, see below,
> > > > >>>>>> but the first part I wanted to convert was
> > > > >>>>>> tcp_ao_copy_mkts_to_user() and that also reads from userspace before
> > > > >>>>>> writing.
> > > > >>>>>>
> > > > >>>>>> So we could go with the optlen_t approach, or we need
> > > > >>>>>> logic for ITER_BOTH or pass two iov_iters one with ITER_SRC and one
> > > > >>>>>> with ITER_DEST...
> > > > >>>>>>
> > > > >>>>>> So who wants to decide?
> > > > >>>>>
> > > > >>>>> I just noticed that it's even possible in same cases
> > > > >>>>> to pass in a short buffer to optval, but have a longer value in optlen,
> > > > >>>>> hci_sock_getsockopt() with SOL_BLUETOOTH completely ignores optlen.
> > > > >>>>>
> > > > >>>>> This makes it really hard to believe that trying to use iov_iter for this
> > > > >>>>> is a good idea :-(
> > > > >>>>
> > > > >>>> That was my finding as well a while ago, when I was planning to get the
> > > > >>>> __user pointers converted to iov_iter. There are some weird ways of
> > > > >>>> using optlen and optval, which makes them non-trivial to covert to
> > > > >>>> iov_iter.
> > > > >>>
> > > > >>> Can we ignore all non-ip/tcp/udp cases for now? This should cover +90%
> > > > >>> of useful socket opts. See if there are any obvious problems with them
> > > > >>> and if not, try converting. The rest we can cover separately when/if
> > > > >>> needed.
> > > > >>
> > > > >> That's what I tried, but it fails with
> > > > >> tcp_getsockopt ->
> > > > >> do_tcp_getsockopt ->
> > > > >> tcp_ao_get_mkts ->
> > > > >> tcp_ao_copy_mkts_to_user ->
> > > > >> copy_struct_from_sockptr
> > > > >> tcp_ao_get_sock_info ->
> > > > >> copy_struct_from_sockptr
> > > > >>
> > > > >> That's not possible with a ITER_DEST iov_iter.
> > > > >>
> > > > >> metze
> > > > >
> > > > > Can we create two iterators over the same memory? One for ITER_SOURCE and
> > > > > another for ITER_DEST. And then make getsockopt_iter accept optval_in and
> > > > > optval_out. We can also use optval_out position (iov_offset) as optlen output
> > > > > value. Don't see why it won't work, but I agree that's gonna be a messy
> > > > > conversion so let's see if someone else has better suggestions.
> > > >
> > > > Yes, that might work, but it would be good to get some feedback
> > > > if this would be the way to go:
> > > >
> > > > int (*getsockopt_iter)(struct socket *sock,
> > > > int level, int optname,
> > > > struct iov_iter *optval_in,
> > > > struct iov_iter *optval_out);
> > > >
> > > > And *optlen = optval_out->iov_offset;
> > > >
> > > > Any objection or better ideas? Linus would that be what you had in mind?
> > >
> > > I'd worry about performance - yes I know 'iter' are used elsewhere but...
> > > Also look at the SCTP code.
> >
> > Performance usually does not matter for set/getsockopts, there
> > are a few exceptions that I know (TCP_ZEROCOPY_RECEIVE)
>
> That might be the one that is really horrid and completely abuses
> the 'length' parameter.
It is reading and writing, yes, but it's not a huge problem. And it
does enforce the optlen (to copy back the same amount of bytes). It's
not that bad, it's just an example of where we need to be extra
careful.
> > and maybe recent
> > devmem sockopts; we can special-case these if needed, or keep sockptr_t,
> > idk. I'm skeptical we can convert everything though, that's why the
> > suggestion to start with sk/ip/tcp/udp.
> >
> > > How do you handle code that wants to return an updated length (often longer
> > > than the one provided) and an error code (eg ERRSIZE or similar).
> > >
> > > There is also a very strange use (I think it is a sockopt rather than an ioctl)
> > > where the buffer length the application provides is only that of the header.
> > > The actual buffer length is contained in the header.
> > > The return length is the amount written into the full buffer.
> >
> > Let's discuss these special cases as they come up? Worst case these
> > places can always re-init iov_iter with a comment on why it is ok.
> > But I do agree in general that there are a few places that do wild
> > stuff.
>
> The problem is that the generic code has to deal with all the 'wild stuff'.
getsockopt_iter will have optval_in for the minority of socket options
(like TCP_ZEROCOPY_RECEIVE) that want to read user's value as well
as optval_out. The latter is what the majority of socket options
will use to write their value. That doesn't seem too complicated to
handle?
> It is also common to do non-sequential accesses - so iov_iter doesn't match
> at all.
I disagree that it's 'common'. Searching for copy_from_sockptr_offset
returns a few cases and they are mostly using read-with-offset because
there is no sequential read (iterator) semantics with sockptr_t.
> There also isn't a requirement for scatter-gather.
>
> For 'normal' getsockopt (and setsockopt) with short lengths it actually makes
> sense for the syscall wrapper to do the user copies.
> But it would need to pass the user ptr+len as well as the kernel ptr+len
> to give the required flexibilty.
> Then you have to work out whether the final copy to user is needed or not.
> (not that hard, but it all adds complication).
Not sure I understand what's the problem. The user vs kernel part will
be abstracted by iov_iter. The callers will have to write the optlen
back. And there are two call sites we care about: io_uring and regular
system call. What's your suggestion? Maybe I'm missing something. Do you
prefer get_optlen/put_optlen?
next prev parent reply other threads:[~2025-04-02 21:21 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-31 20:10 [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() Stefan Metzmacher
2025-03-31 20:10 ` [RFC PATCH 1/4] net: introduce get_optlen() and put_optlen() helpers Stefan Metzmacher
2025-04-01 12:17 ` Breno Leitao
2025-04-01 12:22 ` Stefan Metzmacher
2025-03-31 20:10 ` [RFC PATCH 2/4] net: pass 'optlen_t' to proto[ops].getsockopt() hooks Stefan Metzmacher
2025-03-31 20:27 ` Stefan Metzmacher
2025-03-31 20:10 ` [RFC PATCH 3/4] net: pass a kernel pointer via " Stefan Metzmacher
2025-03-31 21:49 ` David Laight
2025-04-01 8:24 ` Stefan Metzmacher
2025-03-31 20:10 ` [RFC PATCH 4/4] io_uring: let io_uring_cmd_getsockopt() allow level other than SOL_SOCKET Stefan Metzmacher
2025-03-31 21:04 ` [RFC PATCH 0/4] net/io_uring: pass a kernel pointer via optlen_t to proto[_ops].getsockopt() Stanislav Fomichev
2025-04-01 8:19 ` Stefan Metzmacher
2025-04-01 13:37 ` Stefan Metzmacher
2025-04-01 13:48 ` Stefan Metzmacher
2025-04-01 15:35 ` Breno Leitao
2025-04-01 15:45 ` Stanislav Fomichev
2025-04-01 21:20 ` Stefan Metzmacher
2025-04-01 22:04 ` Stanislav Fomichev
2025-04-01 22:53 ` Stefan Metzmacher
2025-04-02 12:29 ` David Laight
2025-04-02 14:19 ` Stanislav Fomichev
2025-04-02 20:46 ` David Laight
2025-04-02 21:07 ` Linus Torvalds
2025-04-02 21:21 ` Stanislav Fomichev [this message]
2025-04-02 22:38 ` David Laight
2025-04-02 23:39 ` Stanislav Fomichev
2025-04-02 0:40 ` Linus Torvalds
2025-04-02 12:35 ` David Laight
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z-2qX_N2-jpMYSIy@mini-arch \
--to=stfomichev@gmail.com \
--cc=alex.aring@gmail.com \
--cc=alibuda@linux.alibaba.com \
--cc=allison.henderson@oracle.com \
--cc=andrew+netdev@lunn.ch \
--cc=asml.silence@gmail.com \
--cc=ast@kernel.org \
--cc=axboe@kernel.dk \
--cc=ayush.sawal@chelsio.com \
--cc=bjorn@kernel.org \
--cc=borisp@nvidia.com \
--cc=bpf@vger.kernel.org \
--cc=courmisch@gmail.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=david.laight.linux@gmail.com \
--cc=dccp@vger.kernel.org \
--cc=dhowells@redhat.com \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=geliang@kernel.org \
--cc=guwen@linux.alibaba.com \
--cc=hawk@kernel.org \
--cc=hch@lst.de \
--cc=horms@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=isdn4linux@listserv.isdn4linux.de \
--cc=isdn@linux-pingi.de \
--cc=jaka@linux.ibm.com \
--cc=jchapman@katalix.com \
--cc=jk@codeconstruct.com.au \
--cc=jmaloy@redhat.com \
--cc=johan.hedberg@gmail.com \
--cc=john.fastabend@gmail.com \
--cc=jonathan.lemon@gmail.com \
--cc=jreuter@yaina.de \
--cc=kernel@pengutronix.de \
--cc=krzk@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=leitao@debian.org \
--cc=linux-afs@lists.infradead.org \
--cc=linux-bluetooth@vger.kernel.org \
--cc=linux-can@vger.kernel.org \
--cc=linux-hams@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux-sctp@vger.kernel.org \
--cc=linux-wpan@vger.kernel.org \
--cc=linux-x25@vger.kernel.org \
--cc=lucien.xin@gmail.com \
--cc=luiz.dentz@gmail.com \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=marc.dionne@auristor.com \
--cc=marcel@holtmann.org \
--cc=marcelo.leitner@gmail.com \
--cc=martineau@kernel.org \
--cc=matt@codeconstruct.com.au \
--cc=matttbe@kernel.org \
--cc=metze@samba.org \
--cc=miquel.raynal@bootlin.com \
--cc=mkl@pengutronix.de \
--cc=mptcp@lists.linux.dev \
--cc=ms@dev.tdt.de \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=o.rempel@pengutronix.de \
--cc=pabeni@redhat.com \
--cc=rds-devel@oss.oracle.com \
--cc=robin@protonic.nl \
--cc=sgarzare@redhat.com \
--cc=socketcan@hartkopp.net \
--cc=stefan@datenfreihafen.org \
--cc=tipc-discussion@lists.sourceforge.net \
--cc=tonylu@linux.alibaba.com \
--cc=torvalds@linux-foundation.org \
--cc=twinkler@linux.ibm.com \
--cc=virtualization@lists.linux.dev \
--cc=wenjia@linux.ibm.com \
--cc=willemb@google.com \
--cc=wintera@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox