public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, io-uring@vger.kernel.org,
	virtualization@lists.linux.dev, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>,
	"Eric Dumazet" <edumazet@google.com>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Simon Horman" <horms@kernel.org>,
	"Donald Hunter" <donald.hunter@gmail.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Andrew Lunn" <andrew+netdev@lunn.ch>,
	"Jeroen de Borst" <jeroendb@google.com>,
	"Harshitha Ramamurthy" <hramamurthy@google.com>,
	"Kuniyuki Iwashima" <kuniyu@amazon.com>,
	"Willem de Bruijn" <willemb@google.com>,
	"Jens Axboe" <axboe@kernel.dk>,
	"Pavel Begunkov" <asml.silence@gmail.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Neal Cardwell" <ncardwell@google.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Jason Wang" <jasowang@redhat.com>,
	"Xuan Zhuo" <xuanzhuo@linux.alibaba.com>,
	"Eugenio Pérez" <eperezma@redhat.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Shuah Khan" <shuah@kernel.org>,
	sdf@fomichev.me, dw@davidwei.uk,
	"Jamal Hadi Salim" <jhs@mojatatu.com>,
	"Victor Nogueira" <victor@mojatatu.com>,
	"Pedro Tammela" <pctammela@mojatatu.com>,
	"Samiullah Khawaja" <skhawaja@google.com>,
	"Kaiyuan Zhang" <kaiyuanz@google.com>
Subject: Re: [PATCH net-next v13 4/9] net: devmem: Implement TX path
Date: Fri, 2 May 2025 12:20:59 -0700	[thread overview]
Message-ID: <CAHS8izMefrkHf9WXerrOY4Wo8U2KmxSVkgY+4JB+6iDuoCZ3WQ@mail.gmail.com> (raw)
In-Reply-To: <53433089-7beb-46cf-ae8a-6c58cd909e31@redhat.com>

On Fri, May 2, 2025 at 4:47 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On 4/29/25 5:26 AM, Mina Almasry wrote:
> > Augment dmabuf binding to be able to handle TX. Additional to all the RX
> > binding, we also create tx_vec needed for the TX path.
> >
> > Provide API for sendmsg to be able to send dmabufs bound to this device:
> >
> > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
> > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.
> >
> > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
> > implementation, while disabling instances where MSG_ZEROCOPY falls back
> > to copying.
> >
> > We additionally pipe the binding down to the new
> > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
> > instead of the traditional page netmems.
> >
> > We also special case skb_frag_dma_map to return the dma-address of these
> > dmabuf net_iovs instead of attempting to map pages.
> >
> > The TX path may release the dmabuf in a context where we cannot wait.
> > This happens when the user unbinds a TX dmabuf while there are still
> > references to its netmems in the TX path. In that case, the netmems will
> > be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
> > this by making __net_devmem_dmabuf_binding_free schedule_work'd.
> >
> > Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
> > of the implementation came from devmem TCP RFC v1[1], which included the
> > TX path, but Stan did all the rebasing on top of netmem/net_iov.
> >
> > Cc: Stanislav Fomichev <sdf@fomichev.me>
> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > Acked-by: Stanislav Fomichev <sdf@fomichev.me>
>
> I'm sorry for the late feedback. A bunch of things I did not notice
> before...
>
> > @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> >
> >       if (msg && msg->msg_ubuf && msg->sg_from_iter)
> >               ret = msg->sg_from_iter(skb, from, length);
> > +     else if (unlikely(binding))
>
> I'm unsure if the unlikely() here (and in similar tests below) it's
> worthy: depending on the actual workload this condition could be very
> likely.
>

Right, for now I'm guessing the MSG_ZEROCOPY use case in the else is
more common workload, and putting the devmem use case in the unlikely
path so I don't regress other use cases. We could revisit this in the
future. In my tests, the devmem workload doesn't seem affected by this
unlikely.

> [...]
> > @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >       int flags, err, copied = 0;
> >       int mss_now = 0, size_goal, copied_syn = 0;
> >       int process_backlog = 0;
> > +     bool sockc_valid = true;
> >       int zc = 0;
> >       long timeo;
> >
> >       flags = msg->msg_flags;
> >
> > +     sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags),
> > +                                     .dmabuf_id = 0 };
>
> the '.dmabuf_id = 0' part is not needed, and possibly the code is
> clearer without it.
>
> > +     if (msg->msg_controllen) {
> > +             err = sock_cmsg_send(sk, msg, &sockc);
> > +             if (unlikely(err))
> > +                     /* Don't return error until MSG_FASTOPEN has been
> > +                      * processed; that may succeed even if the cmsg is
> > +                      * invalid.
> > +                      */
> > +                     sockc_valid = false;
> > +     }
> > +
> >       if ((flags & MSG_ZEROCOPY) && size) {
> >               if (msg->msg_ubuf) {
> >                       uarg = msg->msg_ubuf;
> > @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                               zc = MSG_ZEROCOPY;
> >               } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
> >                       skb = tcp_write_queue_tail(sk);
> > -                     uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
> > +                     uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb),
> > +                                                 sockc_valid && !!sockc.dmabuf_id);
>
> If sock_cmsg_send() failed and the user did not provide a dmabuf_id,
> memory accounting will be incorrect.
>

Forgive me but I don't see it. sockc_valid will be false, so
msg_zerocopy_realloc will do the normal MSG_ZEROCOPY accounting. Then
below whech check sockc_valid in place of where we did the
sock_cmsg_send before, and goto err. I assume the goto err should undo
the memory accounting in the new code as in the old code. Can you
elaborate on the bug you see?

> >                       if (!uarg) {
> >                               err = -ENOBUFS;
> >                               goto out_err;
> > @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >                               zc = MSG_ZEROCOPY;
> >                       else
> >                               uarg_to_msgzc(uarg)->zerocopy = 0;
> > +
> > +                     if (sockc_valid && sockc.dmabuf_id) {
> > +                             binding = net_devmem_get_binding(sk, sockc.dmabuf_id);
> > +                             if (IS_ERR(binding)) {
> > +                                     err = PTR_ERR(binding);
> > +                                     binding = NULL;
> > +                                     goto out_err;
> > +                             }
> > +                     }
> >               }
> >       } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
> >               if (sk->sk_route_caps & NETIF_F_SG)
> >                       zc = MSG_SPLICE_PAGES;
> >       }
> >
> > +     if (sockc_valid && sockc.dmabuf_id &&
> > +         (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) {
> > +             err = -EINVAL;
> > +             goto out_err;
> > +     }
> > +
> >       if (unlikely(flags & MSG_FASTOPEN ||
> >                    inet_test_bit(DEFER_CONNECT, sk)) &&
> >           !tp->repair) {
> > @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> >               /* 'common' sending to sendq */
> >       }
> >
> > -     sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)};
> > -     if (msg->msg_controllen) {
> > -             err = sock_cmsg_send(sk, msg, &sockc);
> > -             if (unlikely(err)) {
> > -                     err = -EINVAL;
> > -                     goto out_err;
> > -             }
> > -     }
> > +     if (!sockc_valid)
> > +             goto out_err;
>
> Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err
> could emit a wrong return value.
>

Good point indeed.

> Possibly it's better to keep the 'dmabuf_id' initialization out of
> sock_cmsg_send() in a separate helper could simplify the handling here?
>

This should be possible as well.

-- 
Thanks,
Mina

  parent reply	other threads:[~2025-05-02 19:21 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-29  3:26 [PATCH net-next v13 0/9] Device memory TCP TX Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 1/9] netmem: add niov->type attribute to distinguish different net_iov types Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 2/9] net: add get_netmem/put_netmem support Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 3/9] net: devmem: TCP tx netlink api Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 4/9] net: devmem: Implement TX path Mina Almasry
2025-05-02 11:47   ` Paolo Abeni
2025-05-02 11:51     ` Paolo Abeni
2025-05-02 19:25       ` Mina Almasry
2025-05-05  7:37         ` Paolo Abeni
2025-05-02 19:20     ` Mina Almasry [this message]
2025-05-05  7:36       ` Paolo Abeni
2025-05-05 19:21         ` Mina Almasry
2025-05-06  1:37   ` Jakub Kicinski
2025-04-29  3:26 ` [PATCH net-next v13 5/9] net: add devmem TCP TX documentation Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 6/9] net: enable driver support for netmem TX Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 7/9] gve: add netmem TX support to GVE DQO-RDA mode Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 8/9] net: check for driver support in netmem TX Mina Almasry
2025-04-29  3:26 ` [PATCH net-next v13 9/9] selftests: ncdevmem: Implement devmem TCP TX Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izMefrkHf9WXerrOY4Wo8U2KmxSVkgY+4JB+6iDuoCZ3WQ@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=donald.hunter@gmail.com \
    --cc=dsahern@kernel.org \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=eperezma@redhat.com \
    --cc=horms@kernel.org \
    --cc=hramamurthy@google.com \
    --cc=io-uring@vger.kernel.org \
    --cc=jasowang@redhat.com \
    --cc=jeroendb@google.com \
    --cc=jhs@mojatatu.com \
    --cc=kaiyuanz@google.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pctammela@mojatatu.com \
    --cc=sdf@fomichev.me \
    --cc=sgarzare@redhat.com \
    --cc=shuah@kernel.org \
    --cc=skhawaja@google.com \
    --cc=stefanha@redhat.com \
    --cc=victor@mojatatu.com \
    --cc=virtualization@lists.linux.dev \
    --cc=willemb@google.com \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox