[PATCH net-next RFC 0/3] net: move .getsockopt away from _

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
@ 2026-01-30 18:46 Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev
  Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
	Breno Leitao

Currently, .getsockopt callback cannot be called with kernel buffers
because it requires userspace addresses:

  int (*getsockopt)(struct socket *sock, int level,
		    int optname, char __user *optval, int __user *optlen);

This prevents kernel callers (io_uring, BPF, etc) from using getsockopt
on levels other than SOL_SOCKET, since they pass kernel pointers rather
than __user pointers.

Following Linus' suggestion [0], this series introduces a wrapper
around iov_iter (sockopt_t) and a temporary getsockopt_iter callback:

  typedef struct sockopt {
	  struct iov_iter iter;
	  int optlen;
  } sockopt_t;

Note: optlen was not suggested by Linus' but I believe it is needed, given
random values could be passed by protocols back to userspace.

And the callback becomes:

  int (*getsockopt_iter)(struct socket *sock, int level,
			 int optname, sockopt_t *opt);

The sockopt_t structure encapsulates:
- An iov_iter for reading/writing option data (works with both user
  and kernel buffers)
- An optlen field for buffer size (input) and returned data size
  (output)

The plan is to enable getsockopt to leverage kernel buffers initially,
but then move .setsockopt from sockptr_t into this as well.

This series:

 1. Adds the sockopt_t type and getsockopt_iter callback to proto_ops
 2. Adds do_sock_getsockopt_iter() helper that prefers getsockopt_iter
 3. Converts one protocol (netlink) to use getsockopt_iter as a proof of
    concept

This is what I have in mind for this work stream, to make it more
digestible:

 * Keep the temporary getsockopt_iter callback allows protocols to
   migrate gradually.
 * Once all protocols have been converted, getsockopt can be removed and
   getsockopt_iter renamed back to getsockopt with the new API.
 * Once the protocols are converted, the SOL_SOCKET limitation in
   io_uring_cmd_getsockopt() will be removed.
 * Covert setsockopt() to also use a similar strategy, moving it away
   from sockptr_t.
 * Remove sockptr_t in the front end (do_sock_getsockopt(),
   io_uring_cmd_getsockopt()) and start with sockopt_t (instead of
   sockptr_t) in __sys_getsockopt() and io_uring_cmd_getsockopt()

Link: https://lore.kernel.org/all/CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com/ [0]
---
Breno Leitao (3):
      net: add getsockopt_iter callback to proto_ops
      net: prefer getsockopt_iter in do_sock_getsockopt
      netlink: convert to getsockopt_iter

 include/linux/net.h      | 19 +++++++++++++++++++
 net/netlink/af_netlink.c | 22 ++++++++++++----------
 net/socket.c             | 42 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 70 insertions(+), 13 deletions(-)
---
base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
change-id: 20260130-getsockopt-9f36625eedcb

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops
  2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev
  Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
	Breno Leitao

Add a new getsockopt_iter callback to struct proto_ops that uses
sockopt_t, a type-safe wrapper around iov_iter. This provides a clean
interface for socket option operations that works with both user and
kernel buffers.

The sockopt_t type encapsulates an iov_iter and an optlen field.

The optlen field, although not suggested by Linus, serves as both input
(buffer size) and output (returned data size), allowing callbacks to
return a random values independent of the bytes written via
copy_to_iter(), so, keep it separated from iov_iter.count.

This is preparatory work for removing the SOL_SOCKET level restriction
from io_uring getsockopt operations.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/net.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/net.h b/include/linux/net.h
index f58b38ab37f8a..94f6c86769afc 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -23,9 +23,26 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/sockptr.h>
+#include <linux/uio.h>
 
 #include <uapi/linux/net.h>
 
+/**
+ * struct sockopt - socket option value container
+ * @iter: iov_iter for reading/writing option data
+ * @optlen: set by callback to indicate returned data size
+ *
+ * Type-safe wrapper for socket option data that works with both
+ * user and kernel buffers.
+ *
+ * The optlen field allows callbacks to return a specific length value
+ * independent of the bytes written via copy_to_iter().
+ */
+typedef struct sockopt {
+	struct iov_iter iter;
+	int optlen;
+} sockopt_t;
+
 struct poll_table_struct;
 struct pipe_inode_info;
 struct inode;
@@ -192,6 +209,8 @@ struct proto_ops {
 				      unsigned int optlen);
 	int		(*getsockopt)(struct socket *sock, int level,
 				      int optname, char __user *optval, int __user *optlen);
+	int		(*getsockopt_iter)(struct socket *sock, int level,
+				      int optname, sockopt_t *opt);
 	void		(*show_fdinfo)(struct seq_file *m, struct socket *sock);
 	int		(*sendmsg)   (struct socket *sock, struct msghdr *m,
 				      size_t total_len);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt
  2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter Breno Leitao
  2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
  3 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev
  Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
	Breno Leitao

Update do_sock_getsockopt() to use the new getsockopt_iter callback
when available. Add do_sock_getsockopt_iter() helper that:

1. Reads optlen from user/kernel space
2. Initializes a sockopt_t with the appropriate iov_iter (kvec for
   kernel, ubuf for user buffers) and sets opt.optlen
3. Calls the protocol's getsockopt_iter callback
4. Writes opt.optlen back to user/kernel space

The callback is responsible for setting opt.optlen to indicate the
returned data size.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 net/socket.c | 42 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 136b98c54fb37..2d830262b1be5 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -77,6 +77,7 @@
 #include <linux/mount.h>
 #include <linux/pseudo_fs.h>
 #include <linux/security.h>
+#include <linux/uio.h>
 #include <linux/syscalls.h>
 #include <linux/compat.h>
 #include <linux/kmod.h>
@@ -2356,6 +2357,38 @@ SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
 INDIRECT_CALLABLE_DECLARE(bool tcp_bpf_bypass_getsockopt(int level,
 							 int optname));
 
+static int do_sock_getsockopt_iter(struct socket *sock,
+				   const struct proto_ops *ops, int level,
+				   int optname, sockptr_t optval,
+				   sockptr_t optlen)
+{
+	struct kvec kvec;
+	sockopt_t opt;
+	int koptlen;
+	int err;
+
+	if (copy_from_sockptr(&koptlen, optlen, sizeof(int)))
+		return -EFAULT;
+
+	if (optval.is_kernel) {
+		kvec.iov_base = optval.kernel;
+		kvec.iov_len = koptlen;
+		iov_iter_kvec(&opt.iter, ITER_DEST, &kvec, 1, koptlen);
+	} else {
+		iov_iter_ubuf(&opt.iter, ITER_DEST, optval.user, koptlen);
+	}
+	opt.optlen = koptlen;
+
+	err = ops->getsockopt_iter(sock, level, optname, &opt);
+	if (err)
+		return err;
+
+	if (copy_to_sockptr(optlen, &opt.optlen, sizeof(int)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int do_sock_getsockopt(struct socket *sock, bool compat, int level,
 		       int optname, sockptr_t optval, sockptr_t optlen)
 {
@@ -2373,15 +2406,18 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
 	ops = READ_ONCE(sock->ops);
 	if (level == SOL_SOCKET) {
 		err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
-	} else if (unlikely(!ops->getsockopt)) {
-		err = -EOPNOTSUPP;
-	} else {
+	} else if (ops->getsockopt_iter) {
+		err = do_sock_getsockopt_iter(sock, ops, level, optname,
+					      optval, optlen);
+	} else if (ops->getsockopt) {
 		if (WARN_ONCE(optval.is_kernel || optlen.is_kernel,
 			      "Invalid argument type"))
 			return -EOPNOTSUPP;
 
 		err = ops->getsockopt(sock, level, optname, optval.user,
 				      optlen.user);
+	} else {
+		err = -EOPNOTSUPP;
 	}
 
 	if (!compat)

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter
  2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
  2026-01-30 18:46 ` [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
  2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
  3 siblings, 0 replies; 10+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev
  Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
	Breno Leitao

Convert netlink's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.

Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()

The optlen field allows callbacks to return a specific length value
independent of the bytes written via copy_to_iter().

This enables io_uring to call netlink's getsockopt with kernel buffers.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 net/netlink/af_netlink.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8e5151f0c6e46..8a195eb1ef761 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -39,6 +39,7 @@
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
+#include <linux/uio.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
 #include <linux/rtnetlink.h>
@@ -1716,7 +1717,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
 }
 
 static int netlink_getsockopt(struct socket *sock, int level, int optname,
-			      char __user *optval, int __user *optlen)
+			      sockopt_t *opt)
 {
 	struct sock *sk = sock->sk;
 	struct netlink_sock *nlk = nlk_sk(sk);
@@ -1726,8 +1727,7 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
 	if (level != SOL_NETLINK)
 		return -ENOPROTOOPT;
 
-	if (get_user(len, optlen))
-		return -EFAULT;
+	len = opt->optlen;
 	if (len < 0)
 		return -EINVAL;
 
@@ -1743,6 +1743,8 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
 		break;
 	case NETLINK_LIST_MEMBERSHIPS: {
 		int pos, idx, shift, err = 0;
+		u32 group_val;
+		size_t size;
 
 		netlink_lock_table();
 		for (pos = 0; pos * 8 < nlk->ngroups; pos += sizeof(u32)) {
@@ -1751,14 +1753,14 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
 
 			idx = pos / sizeof(unsigned long);
 			shift = (pos % sizeof(unsigned long)) * 8;
-			if (put_user((u32)(nlk->groups[idx] >> shift),
-				     (u32 __user *)(optval + pos))) {
+			group_val = (u32)(nlk->groups[idx] >> shift);
+			size = copy_to_iter(&group_val, sizeof(group_val), &opt->iter);
+			if (size != sizeof(group_val)) {
 				err = -EFAULT;
 				break;
 			}
 		}
-		if (put_user(ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32)), optlen))
-			err = -EFAULT;
+		opt->optlen = ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32));
 		netlink_unlock_table();
 		return err;
 	}
@@ -1784,10 +1786,10 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
 	len = sizeof(int);
 	val = test_bit(flag, &nlk->flags);
 
-	if (put_user(len, optlen) ||
-	    copy_to_user(optval, &val, len))
+	if (copy_to_iter(&val, len, &opt->iter) != len)
 		return -EFAULT;
 
+	opt->optlen = sizeof(int);
 	return 0;
 }
 
@@ -2813,7 +2815,7 @@ static const struct proto_ops netlink_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	netlink_setsockopt,
-	.getsockopt =	netlink_getsockopt,
+	.getsockopt_iter =	netlink_getsockopt,
 	.sendmsg =	netlink_sendmsg,
 	.recvmsg =	netlink_recvmsg,
 	.mmap =		sock_no_mmap,

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
                   ` (2 preceding siblings ...)
  2026-01-30 18:46 ` [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter Breno Leitao
@ 2026-01-30 20:52 ` David Laight
  2026-01-31  1:19   ` Linus Torvalds
  2026-02-02 12:32   ` Breno Leitao
  3 siblings, 2 replies; 10+ messages in thread
From: David Laight @ 2026-01-30 20:52 UTC (permalink / raw)
  To: Breno Leitao
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev, io-uring, bpf, netdev, Linus Torvalds,
	linux-kernel, kernel-team

On Fri, 30 Jan 2026 10:46:16 -0800
Breno Leitao <leitao@debian.org> wrote:

> Currently, .getsockopt callback cannot be called with kernel buffers
> because it requires userspace addresses:
> 
>   int (*getsockopt)(struct socket *sock, int level,
> 		    int optname, char __user *optval, int __user *optlen);
> 
> This prevents kernel callers (io_uring, BPF, etc) from using getsockopt
> on levels other than SOL_SOCKET, since they pass kernel pointers rather
> than __user pointers.

I had thoughts about this as well.
I think using iov_iter is over the top and may have measurable performance
impact for some paths.

I think the first thing to do is sort out 'optlen'.
There is absolutely no reason for the user pointer being passed into
all the per-protocol functions.
(and the code that changes that use sockptr_t are just stupid...)
The system call wrapper can do the user copies, it can also suppress
the write if the value is unchanged (which matters with clac/slac).
The obvious change would be to pass the length itself and make the
return value -ERRNO or the size.

The annoyance is the few places that want to return an error and
change optlen.
That might be best addresses by something like:
#define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
which would get picked in the rval < 0 path.
It would also let 'return 0' mean 'don't change the size' requiring
a special return for the one (or two?) places that want to set the
size to zero and return success.

The length passed should also be 'unsigned int' - with a check for
negative values in the system call wrapper.
(There are many broken drivers that treat negative lengths as 4.)

There is not much point making the 'optval' parameter more than
a structure of a user and kernel address - one of which will be NULL.
(This is safer than sockptr_t's discriminant union.)
You can't police the length because it is sometimes only the length
of a header (and in some recent code as well).

I have looked at some of this change - it is enormous.

	David


> 
> Following Linus' suggestion [0], this series introduces a wrapper
> around iov_iter (sockopt_t) and a temporary getsockopt_iter callback:
> 
>   typedef struct sockopt {
> 	  struct iov_iter iter;
> 	  int optlen;
>   } sockopt_t;
> 
> Note: optlen was not suggested by Linus' but I believe it is needed, given
> random values could be passed by protocols back to userspace.
> 
> And the callback becomes:
> 
>   int (*getsockopt_iter)(struct socket *sock, int level,
> 			 int optname, sockopt_t *opt);
> 
> The sockopt_t structure encapsulates:
> - An iov_iter for reading/writing option data (works with both user
>   and kernel buffers)
> - An optlen field for buffer size (input) and returned data size
>   (output)
> 
> The plan is to enable getsockopt to leverage kernel buffers initially,
> but then move .setsockopt from sockptr_t into this as well.
> 
> This series:
> 
>  1. Adds the sockopt_t type and getsockopt_iter callback to proto_ops
>  2. Adds do_sock_getsockopt_iter() helper that prefers getsockopt_iter
>  3. Converts one protocol (netlink) to use getsockopt_iter as a proof of
>     concept
> 
> This is what I have in mind for this work stream, to make it more
> digestible:
> 
>  * Keep the temporary getsockopt_iter callback allows protocols to
>    migrate gradually.
>  * Once all protocols have been converted, getsockopt can be removed and
>    getsockopt_iter renamed back to getsockopt with the new API.
>  * Once the protocols are converted, the SOL_SOCKET limitation in
>    io_uring_cmd_getsockopt() will be removed.
>  * Covert setsockopt() to also use a similar strategy, moving it away
>    from sockptr_t.
>  * Remove sockptr_t in the front end (do_sock_getsockopt(),
>    io_uring_cmd_getsockopt()) and start with sockopt_t (instead of
>    sockptr_t) in __sys_getsockopt() and io_uring_cmd_getsockopt()
> 
> Link: https://lore.kernel.org/all/CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com/ [0]
> ---
> Breno Leitao (3):
>       net: add getsockopt_iter callback to proto_ops
>       net: prefer getsockopt_iter in do_sock_getsockopt
>       netlink: convert to getsockopt_iter
> 
>  include/linux/net.h      | 19 +++++++++++++++++++
>  net/netlink/af_netlink.c | 22 ++++++++++++----------
>  net/socket.c             | 42 +++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 70 insertions(+), 13 deletions(-)
> ---
> base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
> change-id: 20260130-getsockopt-9f36625eedcb
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
@ 2026-01-31  1:19   ` Linus Torvalds
  2026-01-31 15:37     ` David Laight
  2026-02-02 12:32   ` Breno Leitao
  1 sibling, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2026-01-31  1:19 UTC (permalink / raw)
  To: David Laight
  Cc: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	metze, axboe, Stanislav Fomichev, io-uring, bpf, netdev,
	linux-kernel, kernel-team

On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
>
> There is not much point making the 'optval' parameter more than
> a structure of a user and kernel address - one of which will be NULL.

That's exactly what we do *NOT* want. Because people will get it
wrong, and then we're back to the bad old days where trivial bugs
result in security issues.

Can you point to an actual case where setsockopt / getsockopt would be
performance-critical? Typically you do it once or twice.

              Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-01-31  1:19   ` Linus Torvalds
@ 2026-01-31 15:37     ` David Laight
  2026-01-31 15:53       ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2026-01-31 15:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	metze, axboe, Stanislav Fomichev, io-uring, bpf, netdev,
	linux-kernel, kernel-team

On Fri, 30 Jan 2026 17:19:55 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
> >
> > There is not much point making the 'optval' parameter more than
> > a structure of a user and kernel address - one of which will be NULL.  
> 
> That's exactly what we do *NOT* want. Because people will get it
> wrong, and then we're back to the bad old days where trivial bugs
> result in security issues.

It can still be a (semi-)transparent structure that code isn't allowed to change.
That is no different from using iov_iter.

> Can you point to an actual case where setsockopt / getsockopt would be
> performance-critical? Typically you do it once or twice.

IIRC a really horrid one - I think for async io.
That is also one of the few where the supplied length is a lie.

	David

> 
>               Linus
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-01-31 15:37     ` David Laight
@ 2026-01-31 15:53       ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2026-01-31 15:53 UTC (permalink / raw)
  To: David Laight, Linus Torvalds
  Cc: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
	metze, Stanislav Fomichev, io-uring, bpf, netdev, linux-kernel,
	kernel-team

On 1/31/26 8:37 AM, David Laight wrote:
> On Fri, 30 Jan 2026 17:19:55 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
>>>
>>> There is not much point making the 'optval' parameter more than
>>> a structure of a user and kernel address - one of which will be NULL.  
>>
>> That's exactly what we do *NOT* want. Because people will get it
>> wrong, and then we're back to the bad old days where trivial bugs
>> result in security issues.
> 
> It can still be a (semi-)transparent structure that code isn't allowed
> to change. That is no different from using iov_iter.

Then why not just use iov_iter?! FWIW, I fully agree with Linus on this
one. We have an existing abstraction, we should use it. We've previously
optimized common cases, like ITER_UBUF, if that ended up being
important. We're better off using iov_iter and improving that, rather
than some new mixed pointer abomination.

>> Can you point to an actual case where setsockopt / getsockopt would be
>> performance-critical? Typically you do it once or twice.
> 
> IIRC a really horrid one - I think for async io.
> That is also one of the few where the supplied length is a lie.

Huh?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
  2026-01-31  1:19   ` Linus Torvalds
@ 2026-02-02 12:32   ` Breno Leitao
  2026-02-02 22:31     ` David Laight
  1 sibling, 1 reply; 10+ messages in thread
From: Breno Leitao @ 2026-02-02 12:32 UTC (permalink / raw)
  To: David Laight
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev, io-uring, bpf, netdev, Linus Torvalds,
	linux-kernel, kernel-team

Hello David,

On Fri, Jan 30, 2026 at 08:52:27PM +0000, David Laight wrote:

> The system call wrapper can do the user copies, it can also suppress
> the write if the value is unchanged (which matters with clac/slac).

This aligns with my proposal: using an in-kernel optlen that protocol
functions can operate on directly:

	typedef struct sockopt {
		struct iov_iter iter;
		int optlen;
	} sockopt_t;

> The obvious change would be to pass the length itself and make the
> return value -ERRNO or the size.

I explored this approach to avoid embedding optlen in sockopt (which was
Linus' original suggestion). I attempted returning the length both via
iov_iter and as a return value, but neither proved ideal.

> #define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
> which would get picked in the rval < 0 path.
> It would also let 'return 0' mean 'don't change the size' requiring
> a special return for the one (or two?) places that want to set the
> size to zero and return success.

My conclusion is that encoding both optlen and error in the return value
requires pointer manipulation that isn't justified for this slow path.
While technically feasible, the resulting "mixed pointer abomination"
won't be worth it.

> There is not much point making the 'optval' parameter more than
> a structure of a user and kernel address - one of which will be NULL.
> (This is safer than sockptr_t's discriminant union.)

This approach forces every protocol to distinguish between userspace and
kernelspace, then perform the appropriate copy:

  static inline int mgetsockopt(void *kernel_optlen, void *user_optlen, ..)
  {
	....
	if (kernel_optlen)
		memcpy(kernel_optlen, newoptlen, ...
	else
		copy_to_user(user_optlen, newoptlen, ...
  }

Additionally, you'd need safeguards ensuring callers never pass both user
and kernel pointers simultaneously. This seems significantly worse than
using sockptr.

--breno

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
  2026-02-02 12:32   ` Breno Leitao
@ 2026-02-02 22:31     ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2026-02-02 22:31 UTC (permalink / raw)
  To: Breno Leitao
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
	Stanislav Fomichev, io-uring, bpf, netdev, Linus Torvalds,
	linux-kernel, kernel-team

On Mon, 2 Feb 2026 04:32:42 -0800
Breno Leitao <leitao@debian.org> wrote:

> Hello David,
> 
> On Fri, Jan 30, 2026 at 08:52:27PM +0000, David Laight wrote:
> 
> > The system call wrapper can do the user copies, it can also suppress
> > the write if the value is unchanged (which matters with clac/slac).  
> 
> This aligns with my proposal: using an in-kernel optlen that protocol
> functions can operate on directly:
> 
> 	typedef struct sockopt {
> 		struct iov_iter iter;
> 		int optlen;
> 	} sockopt_t;
> 
> > The obvious change would be to pass the length itself and make the
> > return value -ERRNO or the size.  
> 
> I explored this approach to avoid embedding optlen in sockopt (which was
> Linus' original suggestion). I attempted returning the length both via
> iov_iter and as a return value, but neither proved ideal.
> 
> > #define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
> > which would get picked in the rval < 0 path.
> > It would also let 'return 0' mean 'don't change the size' requiring
> > a special return for the one (or two?) places that want to set the
> > size to zero and return success.  
> 
> My conclusion is that encoding both optlen and error in the return value
> requires pointer manipulation that isn't justified for this slow path.
> While technically feasible, the resulting "mixed pointer abomination"
> won't be worth it.

Not really, they are both just numbers.
99% of the protocol code can just do 'return -Exxxx' or 'return size'.
That is all simple and foolproof.
The calling code (not many copies) does:
	rval = foo->getsockopt(..., size_in);
	size_out = size_in;
	if (rval >= 0) {
		if (rval > 0)
			size_out = rval;
		rval = 0;
	} else {
		/* abnormal path */
		if ((rval & (1 << 30))) {
			size_out = rval & 0xffffff;
			rval = -((rval & ~(1 << 31)) >> 20);
		}
	}
	if (size_out != size_in)
		put_user(size_out);
	return rval;
(Or something similar depending on exactly how the values are merged.)

> 
> > There is not much point making the 'optval' parameter more than
> > a structure of a user and kernel address - one of which will be NULL.
> > (This is safer than sockptr_t's discriminant union.)  
> 
> This approach forces every protocol to distinguish between userspace and
> kernelspace, then perform the appropriate copy:
> 
>   static inline int mgetsockopt(void *kernel_optlen, void *user_optlen, ..)
>   {
> 	....
> 	if (kernel_optlen)
> 		memcpy(kernel_optlen, newoptlen, ...
> 	else
> 		copy_to_user(user_optlen, newoptlen, ...
>   }

That is a function provided by the implementation.
It is no different from using the ones that act on iov_iter.
The real difficultly is stopping the usual culprits (bpf an io_uring)
from cheating and looking inside the structures.

> Additionally, you'd need safeguards ensuring callers never pass both user
> and kernel pointers simultaneously. This seems significantly worse than
> using sockptr.

Sockptr has the real disadvantage that it is very easy to mix up the
kernel and user pointers (there is some horrid code that looks inside).
If you have separate pointers that can't happen.
You might access NULL, but you are never going to use the wrong address.
Remember some systems (s390?) use the same numbers for user and kernel
addresses - you have to get it right.
In any case, if both addresses are set you can just have a rule that
one is used by preference - it isn't a problem.

There might be legitimate reasons for setting both pointers.
Consider setsockopt, the wrapper could copy small user structures
into an on-stack buffer.
The structure would then need to contain the address/length of the
kernel buffer as well as the actual user address in case the code
wants to read more that the expected data length.
For a kernel caller you also want the actual length of the buffer
as a separate field from the length of the [sg]etsockopt().

I'm not sure what fields you need for the address buffer.
Probably 'user address', 'kernel address' and 'kernel length',
what you don't need is support for scatter-gather, page list,
pipes etc.


> 
> --breno
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-02-02 22:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter Breno Leitao
2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
2026-01-31  1:19   ` Linus Torvalds
2026-01-31 15:37     ` David Laight
2026-01-31 15:53       ` Jens Axboe
2026-02-02 12:32   ` Breno Leitao
2026-02-02 22:31     ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox