Hi again!

More interesting data for you. :-)

We tried iommu=pt.

1)

Xeon E5-2650 v2 + Intel XL710 40Gbit/s

           4096  8192  10000  12000  16384  65435
zc MB/s    721   1191  1281   1665   1752   2255
zc CPU     95%   99%   99%    99%    99%    98%
send MB/s  2229  2555  2704   2642   2756   2993
send CPU   97%   99%   98%    99%    98%    98%

Xeon E5-2650 v2 + Intel XL710 40Gbit/s, iommu=pt

           4096  8192  10000  12000  16384  32768  65435
zc MB/s    1130  1893  2222   2503   2994   3855   3717
zc CPU     99%   99%   99%    89%    94%    71%    48%
send MB/s  2903  3620  3602   3346   3658   3855   3514
send CPU   98%   89%   96%    99%    89%    82%    74%

Much much better, and makes zero-copy beneficial for >= 32 kb buffers. iommu-related things completely go away from the perf profile with iommu=pt.

2)

Xeon Gold 6342 + Mellanox ConnectX-6 Dx

           4096  8192   10000  12000  16384  32768  65435
zc MB/s    2060  2950   2927   2934   2945   2945   2947
zc CPU     99%   62%    59%    29%    22%    23%    11%
send MB/s  2950  2949   2950   2950   2949   2949   2949
send CPU   64%   44%    50%    46%    51%    49%    45%

Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt

           4096  8192   10000  12000  16384  32768  65435
zc MB/s    2165  2277   2790   2802   2871   2945   2944
zc CPU     99%   89%    75%    65%    53%    34%    36%
send MB/s  2902  2912   2945   2943   2927   2935   2941
send CPU   80%   63%    55%    64%    78%    68%    65%

Here, disabling iommu actually makes things worse - CPU usage increases in all tests. The default mode is optimal.

3)

AMD EPYC Genoa 9554 + Mellanox CX-5

           4096  8192  10000  12000  16384  65435
zc MB/s    864   1495  1646   1714   1790   2266
zc CPU     99%   93%   81%    86%    75%    57%
send MB/s  1799  2167  2265   2285   2248   2286
send CPU   90%   58%   54%    54%    52%    42%

AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt

           4096  8192  10000  12000  16384  65435
zc MB/s    794   1191  1361   1762   1850   2125
zc CPU     99%   84%   84%    99%    82%    60%
send MB/s  2007  2238  2255   2291   2229   2218
send CPU   86%   65%   55%    55%    50%    40%

AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt + hugepages (-l1)

           4096  8192  10000  12000  16384  65435
zc MB/s    804   1539  1718   1749   1666   2310
zc CPU     99%   95%   89%    87%    65%    33%
send MB/s  1763  2262  2323   2296   2235   2285
send CPU   91%   63%   61%    55%    50%    41%

So here zerocopy is just slightly better in just one test - with huge pages and the maximum buffer size.

Flamegraph is in the attachment, it really doesn't include any iommu-related things.

-- 
Vitaliy Filippov