* [PATCH liburing v1 0/2] liburing micro-optimzation @ 2023-01-06 15:42 Ammar Faizi 2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw) To: Jens Axboe Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, Alviro Iskandar Setiawan, GNU/Weeb Mailing List, io-uring Mailing List From: Ammar Faizi <[email protected]> Hi Jens, This series contains liburing micro-optimzation. There are two patches in this series: ## Patch 1 - Fix bloated memset due to unexpected vectorization. Clang and GCC generate an insane vectorized memset() in nolibc.c. liburing doesn't need such a powerful memset(). Add an empty inline ASM to prevent the compilers from over-optimizing the memset(). ## Patch 2 - Simplify `io_uring_register_file_alloc_range()` function. Use a struct initializer instead of memset(). It simplifies the C code plus effectively reduces the code size. Signed-off-by: Ammar Faizi <[email protected]> --- Ammar Faizi (2): nolibc: Fix bloated memset due to unexpected vectorization register: Simplify `io_uring_register_file_alloc_range()` function src/nolibc.c | 9 ++++++++- src/register.c | 9 ++++----- 2 files changed, 12 insertions(+), 6 deletions(-) base-commit: c76d392035fd271980faa297334268f2cd77d774 -- Ammar Faizi ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization 2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi @ 2023-01-06 15:42 ` Ammar Faizi 2023-01-06 15:56 ` Alviro Iskandar Setiawan 2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi 2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe 2 siblings, 1 reply; 6+ messages in thread From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw) To: Jens Axboe Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, Alviro Iskandar Setiawan, GNU/Weeb Mailing List, io-uring Mailing List From: Ammar Faizi <[email protected]> Clang and GCC generate an insane vectorized memset() in nolibc.c. liburing doesn't need such a powerful memset(). Add an empty inline ASM to prevent the compilers from over-optimizing the memset(). Just for comparison, see the following Assembly code (generated by Clang). Before this patch: ``` 0000000000003a00 <__uring_memset>: 3a00: mov %rdi,%rax 3a03: test %rdx,%rdx 3a06: je 3b2c <__uring_memset+0x12c> 3a0c: cmp $0x8,%rdx 3a10: jae 3a19 <__uring_memset+0x19> 3a12: xor %ecx,%ecx 3a14: jmp 3b20 <__uring_memset+0x120> 3a19: movzbl %sil,%r8d 3a1d: cmp $0x20,%rdx 3a21: jae 3a2a <__uring_memset+0x2a> 3a23: xor %ecx,%ecx 3a25: jmp 3ae0 <__uring_memset+0xe0> 3a2a: mov %rdx,%rcx 3a2d: and $0xffffffffffffffe0,%rcx 3a31: movd %r8d,%xmm0 3a36: punpcklbw %xmm0,%xmm0 3a3a: pshuflw $0x0,%xmm0,%xmm0 3a3f: pshufd $0x0,%xmm0,%xmm0 3a44: lea -0x20(%rcx),%rdi 3a48: mov %rdi,%r10 3a4b: shr $0x5,%r10 3a4f: inc %r10 3a52: mov %r10d,%r9d 3a55: and $0x3,%r9d 3a59: cmp $0x60,%rdi 3a5d: jae 3a63 <__uring_memset+0x63> 3a5f: xor %edi,%edi 3a61: jmp 3aa9 <__uring_memset+0xa9> 3a63: and $0xfffffffffffffffc,%r10 3a67: xor %edi,%edi 3a69: nopl 0x0(%rax) 3a70: movdqu %xmm0,(%rax,%rdi,1) 3a75: movdqu %xmm0,0x10(%rax,%rdi,1) 3a7b: movdqu %xmm0,0x20(%rax,%rdi,1) 3a81: movdqu %xmm0,0x30(%rax,%rdi,1) 3a87: movdqu %xmm0,0x40(%rax,%rdi,1) 3a8d: movdqu %xmm0,0x50(%rax,%rdi,1) 3a93: movdqu %xmm0,0x60(%rax,%rdi,1) 3a99: movdqu %xmm0,0x70(%rax,%rdi,1) 3a9f: sub $0xffffffffffffff80,%rdi 3aa3: add $0xfffffffffffffffc,%r10 3aa7: jne 3a70 <__uring_memset+0x70> 3aa9: test %r9,%r9 3aac: je 3ad6 <__uring_memset+0xd6> 3aae: lea (%rdi,%rax,1),%r10 3ab2: add $0x10,%r10 3ab6: shl $0x5,%r9 3aba: xor %edi,%edi 3abc: nopl 0x0(%rax) 3ac0: movdqu %xmm0,-0x10(%r10,%rdi,1) 3ac7: movdqu %xmm0,(%r10,%rdi,1) 3acd: add $0x20,%rdi 3ad1: cmp %rdi,%r9 3ad4: jne 3ac0 <__uring_memset+0xc0> 3ad6: cmp %rdx,%rcx 3ad9: je 3b2c <__uring_memset+0x12c> 3adb: test $0x18,%dl 3ade: je 3b20 <__uring_memset+0x120> 3ae0: mov %rcx,%rdi 3ae3: mov %rdx,%rcx 3ae6: and $0xfffffffffffffff8,%rcx 3aea: movd %r8d,%xmm0 3aef: punpcklbw %xmm0,%xmm0 3af3: pshuflw $0x0,%xmm0,%xmm0 3af8: nopl 0x0(%rax,%rax,1) 3b00: movq %xmm0,(%rax,%rdi,1) 3b05: add $0x8,%rdi 3b09: cmp %rdi,%rcx 3b0c: jne 3b00 <__uring_memset+0x100> 3b0e: cmp %rdx,%rcx 3b11: je 3b2c <__uring_memset+0x12c> 3b13: data16 data16 data16 cs nopw 0x0(%rax,%rax,1) 3b20: mov %sil,(%rax,%rcx,1) 3b24: inc %rcx 3b27: cmp %rcx,%rdx 3b2a: jne 3b20 <__uring_memset+0x120> 3b2c: ret 3b2d: nopl (%rax) ``` After this patch: ``` 0000000000003424 <__uring_memset>: 3424: mov %rdi,%rax 3427: test %rdx,%rdx 342a: je 343a <__uring_memset+0x16> 342c: xor %ecx,%ecx 342e: mov %sil,(%rax,%rcx,1) 3432: inc %rcx 3435: cmp %rcx,%rdx 3438: jne 342e <__uring_memset+0xa> 343a: ret ``` Signed-off-by: Ammar Faizi <[email protected]> --- src/nolibc.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/nolibc.c b/src/nolibc.c index 3207e33..ac81575 100644 --- a/src/nolibc.c +++ b/src/nolibc.c @@ -12,9 +12,16 @@ void *__uring_memset(void *s, int c, size_t n) size_t i; unsigned char *p = s; - for (i = 0; i < n; i++) + for (i = 0; i < n; i++) { p[i] = (unsigned char) c; + /* + * An empty inline ASM to avoid auto-vectorization + * because it's too bloated for liburing. + */ + __asm__ volatile (""); + } + return s; } -- Ammar Faizi ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization 2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi @ 2023-01-06 15:56 ` Alviro Iskandar Setiawan 0 siblings, 0 replies; 6+ messages in thread From: Alviro Iskandar Setiawan @ 2023-01-06 15:56 UTC (permalink / raw) To: Ammar Faizi Cc: Jens Axboe, Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, GNU/Weeb Mailing List, io-uring Mailing List On Fri, Jan 6, 2023 at 10:43 PM Ammar Faizi wrote: > Clang and GCC generate an insane vectorized memset() in nolibc.c. > liburing doesn't need such a powerful memset(). Add an empty inline ASM > to prevent the compilers from over-optimizing the memset(). > > Just for comparison, see the following Assembly code (generated by > Clang). > > Before this patch: > > ``` > 0000000000003a00 <__uring_memset>: > 3a00: mov %rdi,%rax > 3a03: test %rdx,%rdx > 3a06: je 3b2c <__uring_memset+0x12c> > 3a0c: cmp $0x8,%rdx > 3a10: jae 3a19 <__uring_memset+0x19> > 3a12: xor %ecx,%ecx > 3a14: jmp 3b20 <__uring_memset+0x120> > 3a19: movzbl %sil,%r8d > 3a1d: cmp $0x20,%rdx > 3a21: jae 3a2a <__uring_memset+0x2a> > 3a23: xor %ecx,%ecx > 3a25: jmp 3ae0 <__uring_memset+0xe0> > 3a2a: mov %rdx,%rcx > 3a2d: and $0xffffffffffffffe0,%rcx > 3a31: movd %r8d,%xmm0 > 3a36: punpcklbw %xmm0,%xmm0 > 3a3a: pshuflw $0x0,%xmm0,%xmm0 > 3a3f: pshufd $0x0,%xmm0,%xmm0 > 3a44: lea -0x20(%rcx),%rdi > 3a48: mov %rdi,%r10 > 3a4b: shr $0x5,%r10 > 3a4f: inc %r10 > 3a52: mov %r10d,%r9d > 3a55: and $0x3,%r9d > 3a59: cmp $0x60,%rdi > 3a5d: jae 3a63 <__uring_memset+0x63> > 3a5f: xor %edi,%edi > 3a61: jmp 3aa9 <__uring_memset+0xa9> > 3a63: and $0xfffffffffffffffc,%r10 > 3a67: xor %edi,%edi > 3a69: nopl 0x0(%rax) > 3a70: movdqu %xmm0,(%rax,%rdi,1) > 3a75: movdqu %xmm0,0x10(%rax,%rdi,1) > 3a7b: movdqu %xmm0,0x20(%rax,%rdi,1) > 3a81: movdqu %xmm0,0x30(%rax,%rdi,1) > 3a87: movdqu %xmm0,0x40(%rax,%rdi,1) > 3a8d: movdqu %xmm0,0x50(%rax,%rdi,1) > 3a93: movdqu %xmm0,0x60(%rax,%rdi,1) > 3a99: movdqu %xmm0,0x70(%rax,%rdi,1) > 3a9f: sub $0xffffffffffffff80,%rdi > 3aa3: add $0xfffffffffffffffc,%r10 > 3aa7: jne 3a70 <__uring_memset+0x70> > 3aa9: test %r9,%r9 > 3aac: je 3ad6 <__uring_memset+0xd6> > 3aae: lea (%rdi,%rax,1),%r10 > 3ab2: add $0x10,%r10 > 3ab6: shl $0x5,%r9 > 3aba: xor %edi,%edi > 3abc: nopl 0x0(%rax) > 3ac0: movdqu %xmm0,-0x10(%r10,%rdi,1) > 3ac7: movdqu %xmm0,(%r10,%rdi,1) > 3acd: add $0x20,%rdi > 3ad1: cmp %rdi,%r9 > 3ad4: jne 3ac0 <__uring_memset+0xc0> > 3ad6: cmp %rdx,%rcx > 3ad9: je 3b2c <__uring_memset+0x12c> > 3adb: test $0x18,%dl > 3ade: je 3b20 <__uring_memset+0x120> > 3ae0: mov %rcx,%rdi > 3ae3: mov %rdx,%rcx > 3ae6: and $0xfffffffffffffff8,%rcx > 3aea: movd %r8d,%xmm0 > 3aef: punpcklbw %xmm0,%xmm0 > 3af3: pshuflw $0x0,%xmm0,%xmm0 > 3af8: nopl 0x0(%rax,%rax,1) > 3b00: movq %xmm0,(%rax,%rdi,1) > 3b05: add $0x8,%rdi > 3b09: cmp %rdi,%rcx > 3b0c: jne 3b00 <__uring_memset+0x100> > 3b0e: cmp %rdx,%rcx > 3b11: je 3b2c <__uring_memset+0x12c> > 3b13: data16 data16 data16 cs nopw 0x0(%rax,%rax,1) > 3b20: mov %sil,(%rax,%rcx,1) > 3b24: inc %rcx > 3b27: cmp %rcx,%rdx > 3b2a: jne 3b20 <__uring_memset+0x120> > 3b2c: ret > 3b2d: nopl (%rax) > ``` > > After this patch: > > ``` > 0000000000003424 <__uring_memset>: > 3424: mov %rdi,%rax > 3427: test %rdx,%rdx > 342a: je 343a <__uring_memset+0x16> > 342c: xor %ecx,%ecx > 342e: mov %sil,(%rax,%rcx,1) > 3432: inc %rcx > 3435: cmp %rcx,%rdx > 3438: jne 342e <__uring_memset+0xa> > 343a: ret > ``` > > Signed-off-by: Ammar Faizi <[email protected]> Reviewed-by: Alviro Iskandar Setiawan <[email protected]> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function 2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi 2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi @ 2023-01-06 15:42 ` Ammar Faizi 2023-01-06 15:59 ` Alviro Iskandar Setiawan 2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe 2 siblings, 1 reply; 6+ messages in thread From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw) To: Jens Axboe Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, Alviro Iskandar Setiawan, GNU/Weeb Mailing List, io-uring Mailing List From: Ammar Faizi <[email protected]> Use a struct initializer instead of memset(). It simplifies the C code plus effectively reduces the code size. Extra bonus on x86-64. It reduces the stack allocation because it doesn't need to allocate stack for the local variable @range. It can just use 128 bytes of redzone below the `%rsp` (redzone is only available in a leaf function). Before this patch: ``` 0000000000003910 <io_uring_register_file_alloc_range>: 3910: push %rbp 3911: push %r15 3913: push %r14 3915: push %rbx 3916: sub $0x18,%rsp 391a: mov %edx,%r14d 391d: mov %esi,%ebp 391f: mov %rdi,%rbx 3922: lea 0x8(%rsp),%r15 3927: mov $0x10,%edx 392c: mov %r15,%rdi 392f: xor %esi,%esi 3931: call 3a00 <__uring_memset> 3936: mov %ebp,0x8(%rsp) 393a: mov %r14d,0xc(%rsp) 393f: mov 0xc4(%rbx),%edi 3945: mov $0x1ab,%eax 394a: mov $0x19,%esi 394f: mov %r15,%rdx 3952: xor %r10d,%r10d 3955: syscall 3957: add $0x18,%rsp 395b: pop %rbx 395c: pop %r14 395e: pop %r15 3960: pop %rbp 3961: ret 3962: cs nopw 0x0(%rax,%rax,1) 396c: nopl 0x0(%rax) ``` After this patch: ``` 0000000000003910 <io_uring_register_file_alloc_range>: 3910: mov %esi,-0x10(%rsp) # set range.off 3914: mov %edx,-0xc(%rsp) # set range.len 3918: movq $0x0,-0x8(%rsp) # zero the resv 3921: mov 0xc4(%rdi),%edi 3927: lea -0x10(%rsp),%rdx 392c: mov $0x1ab,%eax 3931: mov $0x19,%esi 3936: xor %r10d,%r10d 3939: syscall 393b: ret ``` Signed-off-by: Ammar Faizi <[email protected]> --- src/register.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/src/register.c b/src/register.c index 5fdc6e5..ac4c9e3 100644 --- a/src/register.c +++ b/src/register.c @@ -333,11 +333,10 @@ int io_uring_register_sync_cancel(struct io_uring *ring, int io_uring_register_file_alloc_range(struct io_uring *ring, unsigned off, unsigned len) { - struct io_uring_file_index_range range; - - memset(&range, 0, sizeof(range)); - range.off = off; - range.len = len; + struct io_uring_file_index_range range = { + .off = off, + .len = len + }; return __sys_io_uring_register(ring->ring_fd, IORING_REGISTER_FILE_ALLOC_RANGE, &range, -- Ammar Faizi ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function 2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi @ 2023-01-06 15:59 ` Alviro Iskandar Setiawan 0 siblings, 0 replies; 6+ messages in thread From: Alviro Iskandar Setiawan @ 2023-01-06 15:59 UTC (permalink / raw) To: Ammar Faizi Cc: Jens Axboe, Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, GNU/Weeb Mailing List, io-uring Mailing List On Fri, Jan 6, 2023 at 10:43 PM Ammar Faizi wrote: > Use a struct initializer instead of memset(). It simplifies the C code > plus effectively reduces the code size. > > Extra bonus on x86-64. It reduces the stack allocation because it > doesn't need to allocate stack for the local variable @range. It can > just use 128 bytes of redzone below the `%rsp` (redzone is only > available in a leaf function). > > Before this patch: > > ``` > 0000000000003910 <io_uring_register_file_alloc_range>: > 3910: push %rbp > 3911: push %r15 > 3913: push %r14 > 3915: push %rbx > 3916: sub $0x18,%rsp > 391a: mov %edx,%r14d > 391d: mov %esi,%ebp > 391f: mov %rdi,%rbx > 3922: lea 0x8(%rsp),%r15 > 3927: mov $0x10,%edx > 392c: mov %r15,%rdi > 392f: xor %esi,%esi > 3931: call 3a00 <__uring_memset> > 3936: mov %ebp,0x8(%rsp) > 393a: mov %r14d,0xc(%rsp) > 393f: mov 0xc4(%rbx),%edi > 3945: mov $0x1ab,%eax > 394a: mov $0x19,%esi > 394f: mov %r15,%rdx > 3952: xor %r10d,%r10d > 3955: syscall > 3957: add $0x18,%rsp > 395b: pop %rbx > 395c: pop %r14 > 395e: pop %r15 > 3960: pop %rbp > 3961: ret > 3962: cs nopw 0x0(%rax,%rax,1) > 396c: nopl 0x0(%rax) > ``` > > After this patch: > > ``` > 0000000000003910 <io_uring_register_file_alloc_range>: > 3910: mov %esi,-0x10(%rsp) # set range.off > 3914: mov %edx,-0xc(%rsp) # set range.len > 3918: movq $0x0,-0x8(%rsp) # zero the resv > 3921: mov 0xc4(%rdi),%edi > 3927: lea -0x10(%rsp),%rdx > 392c: mov $0x1ab,%eax > 3931: mov $0x19,%esi > 3936: xor %r10d,%r10d > 3939: syscall > 393b: ret > ``` > > Signed-off-by: Ammar Faizi <[email protected]> Reviewed-by: Alviro Iskandar Setiawan <[email protected]> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 0/2] liburing micro-optimzation 2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi 2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi 2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi @ 2023-01-06 17:08 ` Jens Axboe 2 siblings, 0 replies; 6+ messages in thread From: Jens Axboe @ 2023-01-06 17:08 UTC (permalink / raw) To: Ammar Faizi Cc: Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department, Alviro Iskandar Setiawan, GNU/Weeb Mailing List, io-uring Mailing List On Fri, 06 Jan 2023 22:42:57 +0700, Ammar Faizi wrote: > This series contains liburing micro-optimzation. There are two patches > in this series: > > ## Patch 1 > - Fix bloated memset due to unexpected vectorization. > Clang and GCC generate an insane vectorized memset() in nolibc.c. > liburing doesn't need such a powerful memset(). Add an empty inline ASM > to prevent the compilers from over-optimizing the memset(). > > [...] Applied, thanks! [1/2] nolibc: Fix bloated memset due to unexpected vectorization commit: 913ca9a93fd67a5e5a911d71a33a6de7a1a41101 [2/2] register: Simplify `io_uring_register_file_alloc_range()` function commit: 8ab80b483518d51903c9eed24cf0e1ba826010fc Best regards, -- Jens Axboe ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-01-06 17:08 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi 2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi 2023-01-06 15:56 ` Alviro Iskandar Setiawan 2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi 2023-01-06 15:59 ` Alviro Iskandar Setiawan 2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox