Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Pavel Begunkov <[email protected]>
To: Mina Almasry <[email protected]>, David Wei <[email protected]>
Cc: [email protected], [email protected],
	Jens Axboe <[email protected]>, Jakub Kicinski <[email protected]>,
	Paolo Abeni <[email protected]>,
	"David S. Miller" <[email protected]>,
	Eric Dumazet <[email protected]>,
	Jesper Dangaard Brouer <[email protected]>,
	David Ahern <[email protected]>,
	Stanislav Fomichev <[email protected]>,
	Joe Damato <[email protected]>,
	Pedro Tammela <[email protected]>
Subject: Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
Date: Fri, 1 Nov 2024 21:09:14 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAHS8izNbNCAmecRDCL_rRjMU0Spnqo_BY5pyG1EhF2rZFx+y0A@mail.gmail.com>

On 11/1/24 20:06, Mina Almasry wrote:
...
>> +__maybe_unused
>> +static const struct memory_provider_ops io_uring_pp_zc_ops;
>> +
>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>> +{
>> +       struct net_iov_area *owner = net_iov_owner(niov);
>> +
>> +       return container_of(owner, struct io_zcrx_area, nia);
>> +}
>> +
> 
> We discussed this before I disappeared on vacation but I'm not too
> convinced to be honest, sorry.
> 
> It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
> versa, right? So current and future code has to be very careful to

Yes

> call the right helpers on the right niovs.
> 
> At the very least there needs to be a comment above all these
> container_of helpers:
> 
> /* caller must have verified that this niov is devmem/io_zcrx */.
> 
> However I feel like even a comment is extremely error prone. These
> container_of's are inside of the call stack of some helpers. I would
> say we need a check. If we're concerned about performance, the check
> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
> but could be fine. Doing this without a check seems too risky to me.

No, it doesn't need a check nor it needs a comment. The very
essence of virtual function tables is that they're coupled
together with objects for which those function make sense and
called only for those objects. The only way to get here with
invalid net_iovs is to take one page pool and feed it with
net_iovs from other another page pool that won't be sane in
the first place.

That would be an equivalent of:

struct file *f1 = ...;
struct file *f2 = ...;

f1->f_op->read(f2, ...);

Maybe it looks strange for you in C, but it's same as putting
comments that a virtual function that it should be called only
for objects of that class:

struct A {
	virtual void foo() = 0;
};
struct B: public A {
	void foo() override {
		// we should only be called with objects of type
		// struct B (or anything inheriting it), check that
		if (!reinterpret_cast<struct B*>(this))
			throw;
		...
	}
}

>>   static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>>                                   struct io_uring_zcrx_ifq_reg *reg)
>>   {
>> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>>                  goto err;
>>
>>          for (i = 0; i < nr_pages; i++) {
>> +               struct net_iov *niov = &area->nia.niovs[i];
>> +
>> +               niov->owner = &area->nia;
>>                  area->freelist[i] = i;
>>          }
>>
>> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>>   {
>>          lockdep_assert_held(&ctx->uring_lock);
>>   }
>> +
>> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
>> +{
>> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
>> +}
>> +
>> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
>> +{
>> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
>> +               return false;
>> +
>> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
>> +}
>> +
> 
> Sorry, I have to push back a bit against this. The refcounting of
> netmem is already complicated. the paged netmem has 2 refcounts and
> care needs to be taken when acquiring and dropping refcounts. net_iov
> inherited the pp_ref_count but not the paged refcount, and again need
> some special handling. skb_frag_unref takes very special care checking

Which is why it's using net_iovs.

> pp->recycle, is_pp_netmem, and others to figure out the correct

pp->recycle has nothing to do with the series. We don't add
it in any special way, and if it's broken it's broken even
for non-proivder buffers.

> refcount to put based on the type of the netmem and skb flag.

Just same as with the ->[un]readable flag, which is not
functionally needed, and if it's screwed many things can
go very wrong.

> This code ignores all these generic code
> skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to

I don't see the point, they are not used because they're not
needed. Instead of checking whether it came from a page pool
and whether it's net_iov or not, in the path io_uring returns
it we already apriori know that they're from a specific page
pool, net_iov and from the current provider.

Same for optimisations provided by those helpers, they are
useful when you're transferring buffers from one context to
another, e.g. task recieve path -> napi / page_pool. In this
case they're already fetched in the right context without any
need to additionally jumping through the hoops. If anything,
it'd be odd to jump out of a window to climb a rope on the
other side of the building when you could've just walked 5
meters to the other room.

> niv->pp_ref_count. If this is merged as-is, for posterity any changes

Ok, let's add a helper then

> in netmem refcounting need to also account for this use case opting
> out of these generic code paths that handle all other skb reffing
> including devmem.
> 
> Additionally since you're opting out of the generic unreffing paths
> you're also (as mentioned before) bypassing the pp recycling. AFAICT
> that may be hurting your performance. IIUC you refill
> PP_ALLOC_CACHE_REFILL (64) entries everytime _alloc_netmems is
> entered, and you don't recycle netmem any other way, so your slow path
> is entered 1/64 of the page_pool_alloc calls? That should be much

One virtual call per 64 buffers gets enough of ammortisation. The
cache size can be extended if necessary.

> worse than what the normal pp recycling does, which returns all freed
> netmem into its alloc.cache or the ptr_ring and hits *_alloc_netmems

You return it from a syscall (a special sockopt), I'm pretty sure
overhead of just that syscall without any handling would be more
expensive than one virtual function call. Then you need to hit the
fast cache, and it's not unconditional, it has to be lucky enough
so that napi is not run or scheduled, and even then it has to
be very careful to avoid races. That's the best case for <64 entries
recycling, otherwise it's ptr_ring and spinlocks.

Note, the normal (non-zc) recycling happens in the receive
syscall, but it's not the normal path, and just like devmem we
have to give the buffer to the user and wait until it's returned
back.

> much more rarely. There are also regular perf improvements and testing
> to the generic pool recycling paths you're also opting out of.

For performance, see above. As for testing, tests come after code
functionality, not the other way around. Why we're even adding any
zero copy and interface when it could be old good and well tested
non-zerocopy recv(2)

> I see a lot of downsides to opting out of the generic use cases. Is
> there any reason the normal freeing paths are not applicable to your
> use case?
> 
>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>> +                                     struct net_iov *niov)
>> +{
>> +}
>> +
> 
> Looks unused/empty.

Indeed, slipped through.
...

-- 
Pavel Begunkov

next prev parent reply	other threads:[~2024-11-01 21:09 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
2024-11-01 14:57   ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 02/15] net: generalise net_iov chunk owners David Wei
2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
2024-11-01 17:09   ` Mina Almasry
2024-11-01 17:41     ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
2024-11-01 17:33   ` Mina Almasry
2024-11-01 18:16     ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
2024-11-01 20:05   ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 08/15] net: add helper executing custom callback from napi David Wei
2024-10-29 23:05 ` [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-29 23:05 ` [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-11-01 20:06   ` Mina Almasry
2024-11-01 21:09     ` Pavel Begunkov [this message]
2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-11-01 20:11   ` Mina Almasry
2024-11-01 21:17     ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-11-01 20:16   ` Mina Almasry
2024-11-01 20:35     ` Jens Axboe
2024-11-01 21:12       ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 14/15] io_uring/zcrx: add copy fallback David Wei
2024-10-29 23:05 ` [PATCH v7 15/15] io_uring/zcrx: throttle receive requests David Wei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox