public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* io_uring zero-copy send test results
@ 2025-04-05 16:58 vitalif
  2025-04-05 18:11 ` Pavel Begunkov
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: vitalif @ 2025-04-05 16:58 UTC (permalink / raw)
  To: io-uring

Hi!

We ran some io_uring send-zerocopy tests with our colleagues by using the `send-zerocopy` utility from liburing examples (https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c).

And the results, especially with EPYC, were rather disappointing. :-(

The tests were run using `./send-zerocopy tcp -4 -R` at the server side and `time ./send-zerocopy tcp (-z 0)? (-b 0)? -4 -s 65435 -D 10.252.4.81` at the client side. 65435 was replaced by different buffer sizes.

Conclusion:
- zerocopy send is beneficial for Xeon with at least 12 kb registered buffers and at least 16 kb normal buffers
- worst thing is that with EPYCs, zerocopy send is slower than non-zerocopy in every single test... :-(

Profiling with perf shows that it spends most time in iommu related functions.

So I have a question: are these results expected? Or do I have to tune something to get better results?

1) Xeon Gold 6330 + Mellanox ConnectX-6 DC

-b 1 (fixed buffers, default):

           4096  8192  10000  12000  16384  65435
zc MB/s    1673  2939  2926   2948   2946   2944
zc CPU     100%  80%   58%    43%    31%    14%
send MB/s  2946  2945  2949   2948   2948   2947
send CPU   80%   57%   52%    46%    44%    42%

-b 0:

           4096  8192  10000  12000  16384  65435
zc MB/s    1682  2940  2925   2934   2945   2923
zc CPU     99%   85%   71%    54%    38%    17%
send MB/s  2949  2947  2950   2945   2946   2949
send CPU   74%   55%   48%    47%    45%    39%

2) AMD EPYC GENOA 9554 + Mellanox ConnectX-5

-b 1:

           4096  8192  10000  12000  16384  65435
zc MB/s    864   1495  1646   1714   1790   2266
zc CPU     99%   93%   81%    86%    75%    57%
send MB/s  1799  2167  2265   2285   2248   2286
send CPU   90%   58%   54%    54%    52%    42%

-b 0:

           4096  8192  10000  12000  16384  65435
zc MB/s    778   1274  1476   1732   1798   2246
zc CPU     99%   89%   92%    81%    80%    54%
send MB/s  1791  2069  2239   2233   2194   2253
send CPU   88%   73%   55%    52%    59%    38%

3) AMD EPYC MILAN 7313 + Mellanox ConnectX-5

           4096  8192  10000  12000  16384  65435
zc MB/s    732   1130  1284   1247   1425   1713
zc CPU     99%   81%   82%    77%    62%    33%
send MB/s  1157  1522  1779   1720   1710   1684
send CPU   95%   58%   43%    39%    36%    27%

-- 
Vitaliy Filippov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-05 16:58 io_uring zero-copy send test results vitalif
@ 2025-04-05 18:11 ` Pavel Begunkov
  2025-04-05 21:46 ` vitalif
  2025-04-06 21:08 ` vitalif
  2 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-05 18:11 UTC (permalink / raw)
  To: vitalif, io-uring

On 4/5/25 17:58, vitalif@yourcmc.ru wrote:
> Hi!
> 
> We ran some io_uring send-zerocopy tests with our colleagues by using the `send-zerocopy` utility from liburing examples (https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c).
> 
> And the results, especially with EPYC, were rather disappointing. :-(
> 
> The tests were run using `./send-zerocopy tcp -4 -R` at the server side and `time ./send-zerocopy tcp (-z 0)? (-b 0)? -4 -s 65435 -D 10.252.4.81` at the client side. 65435 was replaced by different buffer sizes.

fwiw, -z1 -b1 is the default, i.e. zc and fixed buffers

> Conclusion:
> - zerocopy send is beneficial for Xeon with at least 12 kb registered buffers and at least 16 kb normal buffers
> - worst thing is that with EPYCs, zerocopy send is slower than non-zerocopy in every single test... :-(
> 
> Profiling with perf shows that it spends most time in iommu related functions.
> 
> So I have a question: are these results expected? Or do I have to tune something to get better results?

Sounds like another case of iommu being painfully slow. The difference
is that while copying normal sends coalesce data into nice big contig
buffers, but zerocopy has to deal with whatever pages it's given. That's
32KB vs 4KB, and the worst case scenario you get 8x more frags (and skbs)
and 8x iommu mappings for zerocopy.

Try huge pages and see if it helps, it's -l1 in the benchmark. I can
also take a look at adding pre-mapped buffers again.

Perf profiles would also be useful to have if you can grab and post
them.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-05 16:58 io_uring zero-copy send test results vitalif
  2025-04-05 18:11 ` Pavel Begunkov
@ 2025-04-05 21:46 ` vitalif
  2025-04-06 21:54   ` Pavel Begunkov
  2025-04-06 21:08 ` vitalif
  2 siblings, 1 reply; 9+ messages in thread
From: vitalif @ 2025-04-05 21:46 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

> fwiw, -z1 -b1 is the default, i.e. zc and fixed buffers

Yes, I know. :-) that's why I re-ran tests with -b 0 the second time.

> Sounds like another case of iommu being painfully slow. The difference
> is that while copying normal sends coalesce data into nice big contig
> buffers, but zerocopy has to deal with whatever pages it's given. That's
> 32KB vs 4KB, and the worst case scenario you get 8x more frags (and skbs)
> and 8x iommu mappings for zerocopy.

Problem is that on EPYC it's slow even with 64k buffers. Being slow is rather expectable with 4k buffers, but 64k...

> Try huge pages and see if it helps, it's -l1 in the benchmark. I can
> also take a look at adding pre-mapped buffers again.
> 
> Perf profiles would also be useful to have if you can grab and post
> them.

I.e. flamegraphs?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-05 16:58 io_uring zero-copy send test results vitalif
  2025-04-05 18:11 ` Pavel Begunkov
  2025-04-05 21:46 ` vitalif
@ 2025-04-06 21:08 ` vitalif
  2025-04-06 22:01   ` Pavel Begunkov
  2025-04-08 12:43   ` vitalif
  2 siblings, 2 replies; 9+ messages in thread
From: vitalif @ 2025-04-06 21:08 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

[-- Attachment #1: Type: text/plain, Size: 2799 bytes --]

Hi again!

More interesting data for you. :-)

We tried iommu=pt.

1)

Xeon E5-2650 v2 + Intel XL710 40Gbit/s

           4096  8192  10000  12000  16384  65435
zc MB/s    721   1191  1281   1665   1752   2255
zc CPU     95%   99%   99%    99%    99%    98%
send MB/s  2229  2555  2704   2642   2756   2993
send CPU   97%   99%   98%    99%    98%    98%

Xeon E5-2650 v2 + Intel XL710 40Gbit/s, iommu=pt

           4096  8192  10000  12000  16384  32768  65435
zc MB/s    1130  1893  2222   2503   2994   3855   3717
zc CPU     99%   99%   99%    89%    94%    71%    48%
send MB/s  2903  3620  3602   3346   3658   3855   3514
send CPU   98%   89%   96%    99%    89%    82%    74%

Much much better, and makes zero-copy beneficial for >= 32 kb buffers. iommu-related things completely go away from the perf profile with iommu=pt.

2)

Xeon Gold 6342 + Mellanox ConnectX-6 Dx

           4096  8192   10000  12000  16384  32768  65435
zc MB/s    2060  2950   2927   2934   2945   2945   2947
zc CPU     99%   62%    59%    29%    22%    23%    11%
send MB/s  2950  2949   2950   2950   2949   2949   2949
send CPU   64%   44%    50%    46%    51%    49%    45%

Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt

           4096  8192   10000  12000  16384  32768  65435
zc MB/s    2165  2277   2790   2802   2871   2945   2944
zc CPU     99%   89%    75%    65%    53%    34%    36%
send MB/s  2902  2912   2945   2943   2927   2935   2941
send CPU   80%   63%    55%    64%    78%    68%    65%

Here, disabling iommu actually makes things worse - CPU usage increases in all tests. The default mode is optimal.

3)

AMD EPYC Genoa 9554 + Mellanox CX-5

           4096  8192  10000  12000  16384  65435
zc MB/s    864   1495  1646   1714   1790   2266
zc CPU     99%   93%   81%    86%    75%    57%
send MB/s  1799  2167  2265   2285   2248   2286
send CPU   90%   58%   54%    54%    52%    42%

AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt

           4096  8192  10000  12000  16384  65435
zc MB/s    794   1191  1361   1762   1850   2125
zc CPU     99%   84%   84%    99%    82%    60%
send MB/s  2007  2238  2255   2291   2229   2218
send CPU   86%   65%   55%    55%    50%    40%

AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt + hugepages (-l1)

           4096  8192  10000  12000  16384  65435
zc MB/s    804   1539  1718   1749   1666   2310
zc CPU     99%   95%   89%    87%    65%    33%
send MB/s  1763  2262  2323   2296   2235   2285
send CPU   91%   63%   61%    55%    50%    41%

So here zerocopy is just slightly better in just one test - with huge pages and the maximum buffer size.

Flamegraph is in the attachment, it really doesn't include any iommu-related things.

-- 
Vitaliy Filippov

[-- Attachment #2: zc_amd_iommu_pt.png --]
[-- Type: image/png, Size: 387534 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-05 21:46 ` vitalif
@ 2025-04-06 21:54   ` Pavel Begunkov
  0 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-06 21:54 UTC (permalink / raw)
  To: vitalif, io-uring

On 4/5/25 22:46, vitalif@yourcmc.ru wrote:
...
>> Perf profiles would also be useful to have if you can grab and post
>> them.
> 
> I.e. flamegraphs?

Doesn't matter, flamegraphs, perf report, raw perf script should also
be fine, but if you do a visualisation, I'd appreciate if it's
interactive, i.e. svg rather than png, as the later loses details.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-06 21:08 ` vitalif
@ 2025-04-06 22:01   ` Pavel Begunkov
  2025-04-08 12:43   ` vitalif
  1 sibling, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-06 22:01 UTC (permalink / raw)
  To: vitalif, io-uring

On 4/6/25 22:08, vitalif@yourcmc.ru wrote:
> Hi again!
> 
> More interesting data for you. :-)
> 
> We tried iommu=pt.

What kernel version you use? I'm specifically interested whether it has:

6fe4220912d19 ("io_uring/notif: implement notification stacking")

That would explain why it's slow even with huge pages.

...
> Xeon Gold 6342 + Mellanox ConnectX-6 Dx
> 
>             4096  8192   10000  12000  16384  32768  65435
> zc MB/s    2060  2950   2927   2934   2945   2945   2947
> zc CPU     99%   62%    59%    29%    22%    23%    11%
> send MB/s  2950  2949   2950   2950   2949   2949   2949
> send CPU   64%   44%    50%    46%    51%    49%    45%
> 
> Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt
> 
>             4096  8192   10000  12000  16384  32768  65435
> zc MB/s    2165  2277   2790   2802   2871   2945   2944
> zc CPU     99%   89%    75%    65%    53%    34%    36%
> send MB/s  2902  2912   2945   2943   2927   2935   2941
> send CPU   80%   63%    55%    64%    78%    68%    65%
> 
> Here, disabling iommu actually makes things worse - CPU usage increases in all tests. The default mode is optimal.

That doesn't make sense. Do you see anything odd in the profile?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-06 21:08 ` vitalif
  2025-04-06 22:01   ` Pavel Begunkov
@ 2025-04-08 12:43   ` vitalif
  2025-04-09  9:24     ` Pavel Begunkov
  2025-04-18  8:50     ` vitalif
  1 sibling, 2 replies; 9+ messages in thread
From: vitalif @ 2025-04-08 12:43 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

> What kernel version you use? I'm specifically interested whether it has:
> 
> 6fe4220912d19 ("io_uring/notif: implement notification stacking")
> 
> That would explain why it's slow even with huge pages.

It was Linux 6.8.12-4-pve (proxmox), so yeah, it didn't include that commit.

We repeated tests with Linux 6.11 also from proxmox:

AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, Linux 6.11

4096 8192 10000 12000 16384 65435
zc MB/s 2288 2422 2149 2396 2506 2476
zc CPU 90% 67% 56% 56% 57% 44%
send MB/s 1685 2033 2389 2343 2281 2415
send CPU 95% 87% 49% 48% 62% 38%

AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, -l1, Linux 6.11

4096 8192 10000 12000 16384 65435
zc MB/s 2359 2509 2351 2508 2384 2424
zc CPU 85% 58% 52% 45% 37% 18%
send MB/s 1503 1892 2325 2447 2434 2440
send CPU 99% 96% 50% 49% 57% 37%

Now it's nice and quick even without huge pages and even with 4k buffers!

> That doesn't make sense. Do you see anything odd in the profile?

Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:

DMAR-IR: IOAPIC id 8 under DRHD base  0x9b7fc000 IOMMU 9
iommu: Default domain type: Translated 
iommu: DMA domain TLB invalidation policy: lazy mode 

With iommu=pt, dmesg has:

DMAR-IR: IOAPIC id 8 under DRHD base  0x9b7fc000 IOMMU 9
iommu: Default domain type: Passthrough (set via kernel command line)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-08 12:43   ` vitalif
@ 2025-04-09  9:24     ` Pavel Begunkov
  2025-04-18  8:50     ` vitalif
  1 sibling, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-09  9:24 UTC (permalink / raw)
  To: vitalif, io-uring

On 4/8/25 13:43, vitalif@yourcmc.ru wrote:
>> What kernel version you use? I'm specifically interested whether it has:
>>
>> 6fe4220912d19 ("io_uring/notif: implement notification stacking")
>>
>> That would explain why it's slow even with huge pages.
> 
> It was Linux 6.8.12-4-pve (proxmox), so yeah, it didn't include that commit.
> 
> We repeated tests with Linux 6.11 also from proxmox:
> 
> AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, Linux 6.11
> 
> 4096 8192 10000 12000 16384 65435
> zc MB/s 2288 2422 2149 2396 2506 2476
> zc CPU 90% 67% 56% 56% 57% 44%
> send MB/s 1685 2033 2389 2343 2281 2415
> send CPU 95% 87% 49% 48% 62% 38%
> 
> AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, -l1, Linux 6.11
> 
> 4096 8192 10000 12000 16384 65435
> zc MB/s 2359 2509 2351 2508 2384 2424
> zc CPU 85% 58% 52% 45% 37% 18%
> send MB/s 1503 1892 2325 2447 2434 2440
> send CPU 99% 96% 50% 49% 57% 37%
> 
> Now it's nice and quick even without huge pages and even with 4k buffers!

Nice! Is ~2400 MB/s a hardware bottleneck? Seems like the t-put
converges to that, while I'd expect the gap to widen as we increase
the size to 64K.

>> That doesn't make sense. Do you see anything odd in the profile?
> 
> Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:
> 
> DMAR-IR: IOAPIC id 8 under DRHD base  0x9b7fc000 IOMMU 9
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: lazy mode
> 
> With iommu=pt, dmesg has:
> 
> DMAR-IR: IOAPIC id 8 under DRHD base  0x9b7fc000 IOMMU 9
> iommu: Default domain type: Passthrough (set via kernel command line)

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: io_uring zero-copy send test results
  2025-04-08 12:43   ` vitalif
  2025-04-09  9:24     ` Pavel Begunkov
@ 2025-04-18  8:50     ` vitalif
  1 sibling, 0 replies; 9+ messages in thread
From: vitalif @ 2025-04-18  8:50 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

Hi,

> Nice! Is ~2400 MB/s a hardware bottleneck? Seems like the t-put
> converges to that, while I'd expect the gap to widen as we increase
> the size to 64K.

I'm not sure, it doesn't seem so, these servers are connected with a 2x100G bonded connection. Iperf -P8 (8 threads) shows 100 Gbit/s...

>> Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the
>> default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:
>> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
>> iommu: Default domain type: Translated
>> iommu: DMA domain TLB invalidation policy: lazy mode
>> With iommu=pt, dmesg has:
>> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
>> iommu: Default domain type: Passthrough (set via kernel command line)

You're probably right, it was just a random mistake, I repeated the test again with iommu=pt and without it and the new results are very close to each other, and iommu=pt result is even slightly better. So never mind :)

Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt, run 2

           4096  8192  10000  12000  16384  32768  65435
zc MB/s    2681  2947  2927   2935   2949   2947   2945
zc CPU     99%   66%   41%    48%    22%    13%    11%
send MB/s  2950  2951  2950   2950   2949   2950   2949
send CPU   48%   35%   31%    34%    28%    29%    25%

Xeon Gold 6342 + Mellanox ConnectX-6 Dx, run 2

           4096  8192  10000  12000  16384  32768  65435
zc MB/s    2262  2948  2925   2935   2946   2947   2947
zc CPU     99%   52%   60%    44%    24%    17%    17%
send MB/s  2950  2949  2950   2950   2950   2950   2950
send CPU   48%   38%   36%    31%    33%    26%    29%

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-18  8:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-05 16:58 io_uring zero-copy send test results vitalif
2025-04-05 18:11 ` Pavel Begunkov
2025-04-05 21:46 ` vitalif
2025-04-06 21:54   ` Pavel Begunkov
2025-04-06 21:08 ` vitalif
2025-04-06 22:01   ` Pavel Begunkov
2025-04-08 12:43   ` vitalif
2025-04-09  9:24     ` Pavel Begunkov
2025-04-18  8:50     ` vitalif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox