* [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops
2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt Breno Leitao
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
Stanislav Fomichev
Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
Breno Leitao
Add a new getsockopt_iter callback to struct proto_ops that uses
sockopt_t, a type-safe wrapper around iov_iter. This provides a clean
interface for socket option operations that works with both user and
kernel buffers.
The sockopt_t type encapsulates an iov_iter and an optlen field.
The optlen field, although not suggested by Linus, serves as both input
(buffer size) and output (returned data size), allowing callbacks to
return a random values independent of the bytes written via
copy_to_iter(), so, keep it separated from iov_iter.count.
This is preparatory work for removing the SOL_SOCKET level restriction
from io_uring getsockopt operations.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
include/linux/net.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h
index f58b38ab37f8a..94f6c86769afc 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -23,9 +23,26 @@
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/sockptr.h>
+#include <linux/uio.h>
#include <uapi/linux/net.h>
+/**
+ * struct sockopt - socket option value container
+ * @iter: iov_iter for reading/writing option data
+ * @optlen: set by callback to indicate returned data size
+ *
+ * Type-safe wrapper for socket option data that works with both
+ * user and kernel buffers.
+ *
+ * The optlen field allows callbacks to return a specific length value
+ * independent of the bytes written via copy_to_iter().
+ */
+typedef struct sockopt {
+ struct iov_iter iter;
+ int optlen;
+} sockopt_t;
+
struct poll_table_struct;
struct pipe_inode_info;
struct inode;
@@ -192,6 +209,8 @@ struct proto_ops {
unsigned int optlen);
int (*getsockopt)(struct socket *sock, int level,
int optname, char __user *optval, int __user *optlen);
+ int (*getsockopt_iter)(struct socket *sock, int level,
+ int optname, sockopt_t *opt);
void (*show_fdinfo)(struct seq_file *m, struct socket *sock);
int (*sendmsg) (struct socket *sock, struct msghdr *m,
size_t total_len);
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt
2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter Breno Leitao
2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
Stanislav Fomichev
Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
Breno Leitao
Update do_sock_getsockopt() to use the new getsockopt_iter callback
when available. Add do_sock_getsockopt_iter() helper that:
1. Reads optlen from user/kernel space
2. Initializes a sockopt_t with the appropriate iov_iter (kvec for
kernel, ubuf for user buffers) and sets opt.optlen
3. Calls the protocol's getsockopt_iter callback
4. Writes opt.optlen back to user/kernel space
The callback is responsible for setting opt.optlen to indicate the
returned data size.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
net/socket.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/net/socket.c b/net/socket.c
index 136b98c54fb37..2d830262b1be5 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -77,6 +77,7 @@
#include <linux/mount.h>
#include <linux/pseudo_fs.h>
#include <linux/security.h>
+#include <linux/uio.h>
#include <linux/syscalls.h>
#include <linux/compat.h>
#include <linux/kmod.h>
@@ -2356,6 +2357,38 @@ SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
INDIRECT_CALLABLE_DECLARE(bool tcp_bpf_bypass_getsockopt(int level,
int optname));
+static int do_sock_getsockopt_iter(struct socket *sock,
+ const struct proto_ops *ops, int level,
+ int optname, sockptr_t optval,
+ sockptr_t optlen)
+{
+ struct kvec kvec;
+ sockopt_t opt;
+ int koptlen;
+ int err;
+
+ if (copy_from_sockptr(&koptlen, optlen, sizeof(int)))
+ return -EFAULT;
+
+ if (optval.is_kernel) {
+ kvec.iov_base = optval.kernel;
+ kvec.iov_len = koptlen;
+ iov_iter_kvec(&opt.iter, ITER_DEST, &kvec, 1, koptlen);
+ } else {
+ iov_iter_ubuf(&opt.iter, ITER_DEST, optval.user, koptlen);
+ }
+ opt.optlen = koptlen;
+
+ err = ops->getsockopt_iter(sock, level, optname, &opt);
+ if (err)
+ return err;
+
+ if (copy_to_sockptr(optlen, &opt.optlen, sizeof(int)))
+ return -EFAULT;
+
+ return 0;
+}
+
int do_sock_getsockopt(struct socket *sock, bool compat, int level,
int optname, sockptr_t optval, sockptr_t optlen)
{
@@ -2373,15 +2406,18 @@ int do_sock_getsockopt(struct socket *sock, bool compat, int level,
ops = READ_ONCE(sock->ops);
if (level == SOL_SOCKET) {
err = sk_getsockopt(sock->sk, level, optname, optval, optlen);
- } else if (unlikely(!ops->getsockopt)) {
- err = -EOPNOTSUPP;
- } else {
+ } else if (ops->getsockopt_iter) {
+ err = do_sock_getsockopt_iter(sock, ops, level, optname,
+ optval, optlen);
+ } else if (ops->getsockopt) {
if (WARN_ONCE(optval.is_kernel || optlen.is_kernel,
"Invalid argument type"))
return -EOPNOTSUPP;
err = ops->getsockopt(sock, level, optname, optval.user,
optlen.user);
+ } else {
+ err = -EOPNOTSUPP;
}
if (!compat)
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter
2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 1/3] net: add getsockopt_iter callback to proto_ops Breno Leitao
2026-01-30 18:46 ` [PATCH net-next RFC 2/3] net: prefer getsockopt_iter in do_sock_getsockopt Breno Leitao
@ 2026-01-30 18:46 ` Breno Leitao
2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-01-30 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
Stanislav Fomichev
Cc: io-uring, bpf, netdev, Linus Torvalds, linux-kernel, kernel-team,
Breno Leitao
Convert netlink's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.
Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
The optlen field allows callbacks to return a specific length value
independent of the bytes written via copy_to_iter().
This enables io_uring to call netlink's getsockopt with kernel buffers.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
net/netlink/af_netlink.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8e5151f0c6e46..8a195eb1ef761 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -39,6 +39,7 @@
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
+#include <linux/uio.h>
#include <linux/skbuff.h>
#include <linux/netdevice.h>
#include <linux/rtnetlink.h>
@@ -1716,7 +1717,7 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname,
}
static int netlink_getsockopt(struct socket *sock, int level, int optname,
- char __user *optval, int __user *optlen)
+ sockopt_t *opt)
{
struct sock *sk = sock->sk;
struct netlink_sock *nlk = nlk_sk(sk);
@@ -1726,8 +1727,7 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
if (level != SOL_NETLINK)
return -ENOPROTOOPT;
- if (get_user(len, optlen))
- return -EFAULT;
+ len = opt->optlen;
if (len < 0)
return -EINVAL;
@@ -1743,6 +1743,8 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
break;
case NETLINK_LIST_MEMBERSHIPS: {
int pos, idx, shift, err = 0;
+ u32 group_val;
+ size_t size;
netlink_lock_table();
for (pos = 0; pos * 8 < nlk->ngroups; pos += sizeof(u32)) {
@@ -1751,14 +1753,14 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
idx = pos / sizeof(unsigned long);
shift = (pos % sizeof(unsigned long)) * 8;
- if (put_user((u32)(nlk->groups[idx] >> shift),
- (u32 __user *)(optval + pos))) {
+ group_val = (u32)(nlk->groups[idx] >> shift);
+ size = copy_to_iter(&group_val, sizeof(group_val), &opt->iter);
+ if (size != sizeof(group_val)) {
err = -EFAULT;
break;
}
}
- if (put_user(ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32)), optlen))
- err = -EFAULT;
+ opt->optlen = ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32));
netlink_unlock_table();
return err;
}
@@ -1784,10 +1786,10 @@ static int netlink_getsockopt(struct socket *sock, int level, int optname,
len = sizeof(int);
val = test_bit(flag, &nlk->flags);
- if (put_user(len, optlen) ||
- copy_to_user(optval, &val, len))
+ if (copy_to_iter(&val, len, &opt->iter) != len)
return -EFAULT;
+ opt->optlen = sizeof(int);
return 0;
}
@@ -2813,7 +2815,7 @@ static const struct proto_ops netlink_ops = {
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = netlink_setsockopt,
- .getsockopt = netlink_getsockopt,
+ .getsockopt_iter = netlink_getsockopt,
.sendmsg = netlink_sendmsg,
.recvmsg = netlink_recvmsg,
.mmap = sock_no_mmap,
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
2026-01-30 18:46 [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers Breno Leitao
` (2 preceding siblings ...)
2026-01-30 18:46 ` [PATCH net-next RFC 3/3] netlink: convert to getsockopt_iter Breno Leitao
@ 2026-01-30 20:52 ` David Laight
2026-01-31 1:19 ` Linus Torvalds
3 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2026-01-30 20:52 UTC (permalink / raw)
To: Breno Leitao
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, metze, axboe,
Stanislav Fomichev, io-uring, bpf, netdev, Linus Torvalds,
linux-kernel, kernel-team
On Fri, 30 Jan 2026 10:46:16 -0800
Breno Leitao <leitao@debian.org> wrote:
> Currently, .getsockopt callback cannot be called with kernel buffers
> because it requires userspace addresses:
>
> int (*getsockopt)(struct socket *sock, int level,
> int optname, char __user *optval, int __user *optlen);
>
> This prevents kernel callers (io_uring, BPF, etc) from using getsockopt
> on levels other than SOL_SOCKET, since they pass kernel pointers rather
> than __user pointers.
I had thoughts about this as well.
I think using iov_iter is over the top and may have measurable performance
impact for some paths.
I think the first thing to do is sort out 'optlen'.
There is absolutely no reason for the user pointer being passed into
all the per-protocol functions.
(and the code that changes that use sockptr_t are just stupid...)
The system call wrapper can do the user copies, it can also suppress
the write if the value is unchanged (which matters with clac/slac).
The obvious change would be to pass the length itself and make the
return value -ERRNO or the size.
The annoyance is the few places that want to return an error and
change optlen.
That might be best addresses by something like:
#define GETSOCKOPT_RVAL(errval, size) (1 << 31 | (errval) << 20 | (size))
which would get picked in the rval < 0 path.
It would also let 'return 0' mean 'don't change the size' requiring
a special return for the one (or two?) places that want to set the
size to zero and return success.
The length passed should also be 'unsigned int' - with a check for
negative values in the system call wrapper.
(There are many broken drivers that treat negative lengths as 4.)
There is not much point making the 'optval' parameter more than
a structure of a user and kernel address - one of which will be NULL.
(This is safer than sockptr_t's discriminant union.)
You can't police the length because it is sometimes only the length
of a header (and in some recent code as well).
I have looked at some of this change - it is enormous.
David
>
> Following Linus' suggestion [0], this series introduces a wrapper
> around iov_iter (sockopt_t) and a temporary getsockopt_iter callback:
>
> typedef struct sockopt {
> struct iov_iter iter;
> int optlen;
> } sockopt_t;
>
> Note: optlen was not suggested by Linus' but I believe it is needed, given
> random values could be passed by protocols back to userspace.
>
> And the callback becomes:
>
> int (*getsockopt_iter)(struct socket *sock, int level,
> int optname, sockopt_t *opt);
>
> The sockopt_t structure encapsulates:
> - An iov_iter for reading/writing option data (works with both user
> and kernel buffers)
> - An optlen field for buffer size (input) and returned data size
> (output)
>
> The plan is to enable getsockopt to leverage kernel buffers initially,
> but then move .setsockopt from sockptr_t into this as well.
>
> This series:
>
> 1. Adds the sockopt_t type and getsockopt_iter callback to proto_ops
> 2. Adds do_sock_getsockopt_iter() helper that prefers getsockopt_iter
> 3. Converts one protocol (netlink) to use getsockopt_iter as a proof of
> concept
>
> This is what I have in mind for this work stream, to make it more
> digestible:
>
> * Keep the temporary getsockopt_iter callback allows protocols to
> migrate gradually.
> * Once all protocols have been converted, getsockopt can be removed and
> getsockopt_iter renamed back to getsockopt with the new API.
> * Once the protocols are converted, the SOL_SOCKET limitation in
> io_uring_cmd_getsockopt() will be removed.
> * Covert setsockopt() to also use a similar strategy, moving it away
> from sockptr_t.
> * Remove sockptr_t in the front end (do_sock_getsockopt(),
> io_uring_cmd_getsockopt()) and start with sockopt_t (instead of
> sockptr_t) in __sys_getsockopt() and io_uring_cmd_getsockopt()
>
> Link: https://lore.kernel.org/all/CAHk-=whmzrO-BMU=uSVXbuoLi-3tJsO=0kHj1BCPBE3F2kVhTA@mail.gmail.com/ [0]
> ---
> Breno Leitao (3):
> net: add getsockopt_iter callback to proto_ops
> net: prefer getsockopt_iter in do_sock_getsockopt
> netlink: convert to getsockopt_iter
>
> include/linux/net.h | 19 +++++++++++++++++++
> net/netlink/af_netlink.c | 22 ++++++++++++----------
> net/socket.c | 42 +++++++++++++++++++++++++++++++++++++++---
> 3 files changed, 70 insertions(+), 13 deletions(-)
> ---
> base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
> change-id: 20260130-getsockopt-9f36625eedcb
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers
2026-01-30 20:52 ` [PATCH net-next RFC 0/3] net: move .getsockopt away from __user buffers David Laight
@ 2026-01-31 1:19 ` Linus Torvalds
0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2026-01-31 1:19 UTC (permalink / raw)
To: David Laight
Cc: Breno Leitao, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn,
metze, axboe, Stanislav Fomichev, io-uring, bpf, netdev,
linux-kernel, kernel-team
On Fri, 30 Jan 2026 at 14:40, David Laight <david.laight.linux@gmail.com> wrote:
>
> There is not much point making the 'optval' parameter more than
> a structure of a user and kernel address - one of which will be NULL.
That's exactly what we do *NOT* want. Because people will get it
wrong, and then we're back to the bad old days where trivial bugs
result in security issues.
Can you point to an actual case where setsockopt / getsockopt would be
performance-critical? Typically you do it once or twice.
Linus
^ permalink raw reply [flat|nested] 6+ messages in thread