* io_uring zero-copy send test results
@ 2025-04-05 16:58 vitalif
2025-04-05 18:11 ` Pavel Begunkov
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: vitalif @ 2025-04-05 16:58 UTC (permalink / raw)
To: io-uring
Hi!
We ran some io_uring send-zerocopy tests with our colleagues by using the `send-zerocopy` utility from liburing examples (https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c).
And the results, especially with EPYC, were rather disappointing. :-(
The tests were run using `./send-zerocopy tcp -4 -R` at the server side and `time ./send-zerocopy tcp (-z 0)? (-b 0)? -4 -s 65435 -D 10.252.4.81` at the client side. 65435 was replaced by different buffer sizes.
Conclusion:
- zerocopy send is beneficial for Xeon with at least 12 kb registered buffers and at least 16 kb normal buffers
- worst thing is that with EPYCs, zerocopy send is slower than non-zerocopy in every single test... :-(
Profiling with perf shows that it spends most time in iommu related functions.
So I have a question: are these results expected? Or do I have to tune something to get better results?
1) Xeon Gold 6330 + Mellanox ConnectX-6 DC
-b 1 (fixed buffers, default):
4096 8192 10000 12000 16384 65435
zc MB/s 1673 2939 2926 2948 2946 2944
zc CPU 100% 80% 58% 43% 31% 14%
send MB/s 2946 2945 2949 2948 2948 2947
send CPU 80% 57% 52% 46% 44% 42%
-b 0:
4096 8192 10000 12000 16384 65435
zc MB/s 1682 2940 2925 2934 2945 2923
zc CPU 99% 85% 71% 54% 38% 17%
send MB/s 2949 2947 2950 2945 2946 2949
send CPU 74% 55% 48% 47% 45% 39%
2) AMD EPYC GENOA 9554 + Mellanox ConnectX-5
-b 1:
4096 8192 10000 12000 16384 65435
zc MB/s 864 1495 1646 1714 1790 2266
zc CPU 99% 93% 81% 86% 75% 57%
send MB/s 1799 2167 2265 2285 2248 2286
send CPU 90% 58% 54% 54% 52% 42%
-b 0:
4096 8192 10000 12000 16384 65435
zc MB/s 778 1274 1476 1732 1798 2246
zc CPU 99% 89% 92% 81% 80% 54%
send MB/s 1791 2069 2239 2233 2194 2253
send CPU 88% 73% 55% 52% 59% 38%
3) AMD EPYC MILAN 7313 + Mellanox ConnectX-5
4096 8192 10000 12000 16384 65435
zc MB/s 732 1130 1284 1247 1425 1713
zc CPU 99% 81% 82% 77% 62% 33%
send MB/s 1157 1522 1779 1720 1710 1684
send CPU 95% 58% 43% 39% 36% 27%
--
Vitaliy Filippov
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-05 16:58 io_uring zero-copy send test results vitalif
@ 2025-04-05 18:11 ` Pavel Begunkov
2025-04-05 21:46 ` vitalif
2025-04-06 21:08 ` vitalif
2 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-05 18:11 UTC (permalink / raw)
To: vitalif, io-uring
On 4/5/25 17:58, vitalif@yourcmc.ru wrote:
> Hi!
>
> We ran some io_uring send-zerocopy tests with our colleagues by using the `send-zerocopy` utility from liburing examples (https://github.com/axboe/liburing/blob/master/examples/send-zerocopy.c).
>
> And the results, especially with EPYC, were rather disappointing. :-(
>
> The tests were run using `./send-zerocopy tcp -4 -R` at the server side and `time ./send-zerocopy tcp (-z 0)? (-b 0)? -4 -s 65435 -D 10.252.4.81` at the client side. 65435 was replaced by different buffer sizes.
fwiw, -z1 -b1 is the default, i.e. zc and fixed buffers
> Conclusion:
> - zerocopy send is beneficial for Xeon with at least 12 kb registered buffers and at least 16 kb normal buffers
> - worst thing is that with EPYCs, zerocopy send is slower than non-zerocopy in every single test... :-(
>
> Profiling with perf shows that it spends most time in iommu related functions.
>
> So I have a question: are these results expected? Or do I have to tune something to get better results?
Sounds like another case of iommu being painfully slow. The difference
is that while copying normal sends coalesce data into nice big contig
buffers, but zerocopy has to deal with whatever pages it's given. That's
32KB vs 4KB, and the worst case scenario you get 8x more frags (and skbs)
and 8x iommu mappings for zerocopy.
Try huge pages and see if it helps, it's -l1 in the benchmark. I can
also take a look at adding pre-mapped buffers again.
Perf profiles would also be useful to have if you can grab and post
them.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-05 16:58 io_uring zero-copy send test results vitalif
2025-04-05 18:11 ` Pavel Begunkov
@ 2025-04-05 21:46 ` vitalif
2025-04-06 21:54 ` Pavel Begunkov
2025-04-06 21:08 ` vitalif
2 siblings, 1 reply; 9+ messages in thread
From: vitalif @ 2025-04-05 21:46 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
> fwiw, -z1 -b1 is the default, i.e. zc and fixed buffers
Yes, I know. :-) that's why I re-ran tests with -b 0 the second time.
> Sounds like another case of iommu being painfully slow. The difference
> is that while copying normal sends coalesce data into nice big contig
> buffers, but zerocopy has to deal with whatever pages it's given. That's
> 32KB vs 4KB, and the worst case scenario you get 8x more frags (and skbs)
> and 8x iommu mappings for zerocopy.
Problem is that on EPYC it's slow even with 64k buffers. Being slow is rather expectable with 4k buffers, but 64k...
> Try huge pages and see if it helps, it's -l1 in the benchmark. I can
> also take a look at adding pre-mapped buffers again.
>
> Perf profiles would also be useful to have if you can grab and post
> them.
I.e. flamegraphs?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-05 21:46 ` vitalif
@ 2025-04-06 21:54 ` Pavel Begunkov
0 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-06 21:54 UTC (permalink / raw)
To: vitalif, io-uring
On 4/5/25 22:46, vitalif@yourcmc.ru wrote:
...
>> Perf profiles would also be useful to have if you can grab and post
>> them.
>
> I.e. flamegraphs?
Doesn't matter, flamegraphs, perf report, raw perf script should also
be fine, but if you do a visualisation, I'd appreciate if it's
interactive, i.e. svg rather than png, as the later loses details.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-05 16:58 io_uring zero-copy send test results vitalif
2025-04-05 18:11 ` Pavel Begunkov
2025-04-05 21:46 ` vitalif
@ 2025-04-06 21:08 ` vitalif
2025-04-06 22:01 ` Pavel Begunkov
2025-04-08 12:43 ` vitalif
2 siblings, 2 replies; 9+ messages in thread
From: vitalif @ 2025-04-06 21:08 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
[-- Attachment #1: Type: text/plain, Size: 2799 bytes --]
Hi again!
More interesting data for you. :-)
We tried iommu=pt.
1)
Xeon E5-2650 v2 + Intel XL710 40Gbit/s
4096 8192 10000 12000 16384 65435
zc MB/s 721 1191 1281 1665 1752 2255
zc CPU 95% 99% 99% 99% 99% 98%
send MB/s 2229 2555 2704 2642 2756 2993
send CPU 97% 99% 98% 99% 98% 98%
Xeon E5-2650 v2 + Intel XL710 40Gbit/s, iommu=pt
4096 8192 10000 12000 16384 32768 65435
zc MB/s 1130 1893 2222 2503 2994 3855 3717
zc CPU 99% 99% 99% 89% 94% 71% 48%
send MB/s 2903 3620 3602 3346 3658 3855 3514
send CPU 98% 89% 96% 99% 89% 82% 74%
Much much better, and makes zero-copy beneficial for >= 32 kb buffers. iommu-related things completely go away from the perf profile with iommu=pt.
2)
Xeon Gold 6342 + Mellanox ConnectX-6 Dx
4096 8192 10000 12000 16384 32768 65435
zc MB/s 2060 2950 2927 2934 2945 2945 2947
zc CPU 99% 62% 59% 29% 22% 23% 11%
send MB/s 2950 2949 2950 2950 2949 2949 2949
send CPU 64% 44% 50% 46% 51% 49% 45%
Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt
4096 8192 10000 12000 16384 32768 65435
zc MB/s 2165 2277 2790 2802 2871 2945 2944
zc CPU 99% 89% 75% 65% 53% 34% 36%
send MB/s 2902 2912 2945 2943 2927 2935 2941
send CPU 80% 63% 55% 64% 78% 68% 65%
Here, disabling iommu actually makes things worse - CPU usage increases in all tests. The default mode is optimal.
3)
AMD EPYC Genoa 9554 + Mellanox CX-5
4096 8192 10000 12000 16384 65435
zc MB/s 864 1495 1646 1714 1790 2266
zc CPU 99% 93% 81% 86% 75% 57%
send MB/s 1799 2167 2265 2285 2248 2286
send CPU 90% 58% 54% 54% 52% 42%
AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt
4096 8192 10000 12000 16384 65435
zc MB/s 794 1191 1361 1762 1850 2125
zc CPU 99% 84% 84% 99% 82% 60%
send MB/s 2007 2238 2255 2291 2229 2218
send CPU 86% 65% 55% 55% 50% 40%
AMD EPYC Genoa 9554 + Mellanox CX-5 + iommu=pt + hugepages (-l1)
4096 8192 10000 12000 16384 65435
zc MB/s 804 1539 1718 1749 1666 2310
zc CPU 99% 95% 89% 87% 65% 33%
send MB/s 1763 2262 2323 2296 2235 2285
send CPU 91% 63% 61% 55% 50% 41%
So here zerocopy is just slightly better in just one test - with huge pages and the maximum buffer size.
Flamegraph is in the attachment, it really doesn't include any iommu-related things.
--
Vitaliy Filippov
[-- Attachment #2: zc_amd_iommu_pt.png --]
[-- Type: image/png, Size: 387534 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-06 21:08 ` vitalif
@ 2025-04-06 22:01 ` Pavel Begunkov
2025-04-08 12:43 ` vitalif
1 sibling, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-06 22:01 UTC (permalink / raw)
To: vitalif, io-uring
On 4/6/25 22:08, vitalif@yourcmc.ru wrote:
> Hi again!
>
> More interesting data for you. :-)
>
> We tried iommu=pt.
What kernel version you use? I'm specifically interested whether it has:
6fe4220912d19 ("io_uring/notif: implement notification stacking")
That would explain why it's slow even with huge pages.
...
> Xeon Gold 6342 + Mellanox ConnectX-6 Dx
>
> 4096 8192 10000 12000 16384 32768 65435
> zc MB/s 2060 2950 2927 2934 2945 2945 2947
> zc CPU 99% 62% 59% 29% 22% 23% 11%
> send MB/s 2950 2949 2950 2950 2949 2949 2949
> send CPU 64% 44% 50% 46% 51% 49% 45%
>
> Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt
>
> 4096 8192 10000 12000 16384 32768 65435
> zc MB/s 2165 2277 2790 2802 2871 2945 2944
> zc CPU 99% 89% 75% 65% 53% 34% 36%
> send MB/s 2902 2912 2945 2943 2927 2935 2941
> send CPU 80% 63% 55% 64% 78% 68% 65%
>
> Here, disabling iommu actually makes things worse - CPU usage increases in all tests. The default mode is optimal.
That doesn't make sense. Do you see anything odd in the profile?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-06 21:08 ` vitalif
2025-04-06 22:01 ` Pavel Begunkov
@ 2025-04-08 12:43 ` vitalif
2025-04-09 9:24 ` Pavel Begunkov
2025-04-18 8:50 ` vitalif
1 sibling, 2 replies; 9+ messages in thread
From: vitalif @ 2025-04-08 12:43 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
> What kernel version you use? I'm specifically interested whether it has:
>
> 6fe4220912d19 ("io_uring/notif: implement notification stacking")
>
> That would explain why it's slow even with huge pages.
It was Linux 6.8.12-4-pve (proxmox), so yeah, it didn't include that commit.
We repeated tests with Linux 6.11 also from proxmox:
AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, Linux 6.11
4096 8192 10000 12000 16384 65435
zc MB/s 2288 2422 2149 2396 2506 2476
zc CPU 90% 67% 56% 56% 57% 44%
send MB/s 1685 2033 2389 2343 2281 2415
send CPU 95% 87% 49% 48% 62% 38%
AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, -l1, Linux 6.11
4096 8192 10000 12000 16384 65435
zc MB/s 2359 2509 2351 2508 2384 2424
zc CPU 85% 58% 52% 45% 37% 18%
send MB/s 1503 1892 2325 2447 2434 2440
send CPU 99% 96% 50% 49% 57% 37%
Now it's nice and quick even without huge pages and even with 4k buffers!
> That doesn't make sense. Do you see anything odd in the profile?
Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:
DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
iommu: Default domain type: Translated
iommu: DMA domain TLB invalidation policy: lazy mode
With iommu=pt, dmesg has:
DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
iommu: Default domain type: Passthrough (set via kernel command line)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-08 12:43 ` vitalif
@ 2025-04-09 9:24 ` Pavel Begunkov
2025-04-18 8:50 ` vitalif
1 sibling, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2025-04-09 9:24 UTC (permalink / raw)
To: vitalif, io-uring
On 4/8/25 13:43, vitalif@yourcmc.ru wrote:
>> What kernel version you use? I'm specifically interested whether it has:
>>
>> 6fe4220912d19 ("io_uring/notif: implement notification stacking")
>>
>> That would explain why it's slow even with huge pages.
>
> It was Linux 6.8.12-4-pve (proxmox), so yeah, it didn't include that commit.
>
> We repeated tests with Linux 6.11 also from proxmox:
>
> AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, Linux 6.11
>
> 4096 8192 10000 12000 16384 65435
> zc MB/s 2288 2422 2149 2396 2506 2476
> zc CPU 90% 67% 56% 56% 57% 44%
> send MB/s 1685 2033 2389 2343 2281 2415
> send CPU 95% 87% 49% 48% 62% 38%
>
> AMD EPYC GENOA 9554 MELLANOX CX-5, iommu=pt, -l1, Linux 6.11
>
> 4096 8192 10000 12000 16384 65435
> zc MB/s 2359 2509 2351 2508 2384 2424
> zc CPU 85% 58% 52% 45% 37% 18%
> send MB/s 1503 1892 2325 2447 2434 2440
> send CPU 99% 96% 50% 49% 57% 37%
>
> Now it's nice and quick even without huge pages and even with 4k buffers!
Nice! Is ~2400 MB/s a hardware bottleneck? Seems like the t-put
converges to that, while I'd expect the gap to widen as we increase
the size to 64K.
>> That doesn't make sense. Do you see anything odd in the profile?
>
> Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:
>
> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
> iommu: Default domain type: Translated
> iommu: DMA domain TLB invalidation policy: lazy mode
>
> With iommu=pt, dmesg has:
>
> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
> iommu: Default domain type: Passthrough (set via kernel command line)
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: io_uring zero-copy send test results
2025-04-08 12:43 ` vitalif
2025-04-09 9:24 ` Pavel Begunkov
@ 2025-04-18 8:50 ` vitalif
1 sibling, 0 replies; 9+ messages in thread
From: vitalif @ 2025-04-18 8:50 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
Hi,
> Nice! Is ~2400 MB/s a hardware bottleneck? Seems like the t-put
> converges to that, while I'd expect the gap to widen as we increase
> the size to 64K.
I'm not sure, it doesn't seem so, these servers are connected with a 2x100G bonded connection. Iperf -P8 (8 threads) shows 100 Gbit/s...
>> Didn't have time to repeat tests with perf on those servers yet, but I can check dmesg logs. In the
>> default iommu mode, /sys/class/iommu is empty and dmesg includes the following lines:
>> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
>> iommu: Default domain type: Translated
>> iommu: DMA domain TLB invalidation policy: lazy mode
>> With iommu=pt, dmesg has:
>> DMAR-IR: IOAPIC id 8 under DRHD base 0x9b7fc000 IOMMU 9
>> iommu: Default domain type: Passthrough (set via kernel command line)
You're probably right, it was just a random mistake, I repeated the test again with iommu=pt and without it and the new results are very close to each other, and iommu=pt result is even slightly better. So never mind :)
Xeon Gold 6342 + Mellanox ConnectX-6 Dx + iommu=pt, run 2
4096 8192 10000 12000 16384 32768 65435
zc MB/s 2681 2947 2927 2935 2949 2947 2945
zc CPU 99% 66% 41% 48% 22% 13% 11%
send MB/s 2950 2951 2950 2950 2949 2950 2949
send CPU 48% 35% 31% 34% 28% 29% 25%
Xeon Gold 6342 + Mellanox ConnectX-6 Dx, run 2
4096 8192 10000 12000 16384 32768 65435
zc MB/s 2262 2948 2925 2935 2946 2947 2947
zc CPU 99% 52% 60% 44% 24% 17% 17%
send MB/s 2950 2949 2950 2950 2950 2950 2950
send CPU 48% 38% 36% 31% 33% 26% 29%
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-18 8:58 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-05 16:58 io_uring zero-copy send test results vitalif
2025-04-05 18:11 ` Pavel Begunkov
2025-04-05 21:46 ` vitalif
2025-04-06 21:54 ` Pavel Begunkov
2025-04-06 21:08 ` vitalif
2025-04-06 22:01 ` Pavel Begunkov
2025-04-08 12:43 ` vitalif
2025-04-09 9:24 ` Pavel Begunkov
2025-04-18 8:50 ` vitalif
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox