* [PATCH liburing v1 0/2] liburing micro-optimzation
@ 2023-01-06 15:42 Ammar Faizi
2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw)
To: Jens Axboe
Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy,
VNLX Kernel Department, Alviro Iskandar Setiawan,
GNU/Weeb Mailing List, io-uring Mailing List
From: Ammar Faizi <[email protected]>
Hi Jens,
This series contains liburing micro-optimzation. There are two patches
in this series:
## Patch 1
- Fix bloated memset due to unexpected vectorization.
Clang and GCC generate an insane vectorized memset() in nolibc.c.
liburing doesn't need such a powerful memset(). Add an empty inline ASM
to prevent the compilers from over-optimizing the memset().
## Patch 2
- Simplify `io_uring_register_file_alloc_range()` function.
Use a struct initializer instead of memset(). It simplifies the C code
plus effectively reduces the code size.
Signed-off-by: Ammar Faizi <[email protected]>
---
Ammar Faizi (2):
nolibc: Fix bloated memset due to unexpected vectorization
register: Simplify `io_uring_register_file_alloc_range()` function
src/nolibc.c | 9 ++++++++-
src/register.c | 9 ++++-----
2 files changed, 12 insertions(+), 6 deletions(-)
base-commit: c76d392035fd271980faa297334268f2cd77d774
--
Ammar Faizi
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization
2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi
@ 2023-01-06 15:42 ` Ammar Faizi
2023-01-06 15:56 ` Alviro Iskandar Setiawan
2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi
2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe
2 siblings, 1 reply; 6+ messages in thread
From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw)
To: Jens Axboe
Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy,
VNLX Kernel Department, Alviro Iskandar Setiawan,
GNU/Weeb Mailing List, io-uring Mailing List
From: Ammar Faizi <[email protected]>
Clang and GCC generate an insane vectorized memset() in nolibc.c.
liburing doesn't need such a powerful memset(). Add an empty inline ASM
to prevent the compilers from over-optimizing the memset().
Just for comparison, see the following Assembly code (generated by
Clang).
Before this patch:
```
0000000000003a00 <__uring_memset>:
3a00: mov %rdi,%rax
3a03: test %rdx,%rdx
3a06: je 3b2c <__uring_memset+0x12c>
3a0c: cmp $0x8,%rdx
3a10: jae 3a19 <__uring_memset+0x19>
3a12: xor %ecx,%ecx
3a14: jmp 3b20 <__uring_memset+0x120>
3a19: movzbl %sil,%r8d
3a1d: cmp $0x20,%rdx
3a21: jae 3a2a <__uring_memset+0x2a>
3a23: xor %ecx,%ecx
3a25: jmp 3ae0 <__uring_memset+0xe0>
3a2a: mov %rdx,%rcx
3a2d: and $0xffffffffffffffe0,%rcx
3a31: movd %r8d,%xmm0
3a36: punpcklbw %xmm0,%xmm0
3a3a: pshuflw $0x0,%xmm0,%xmm0
3a3f: pshufd $0x0,%xmm0,%xmm0
3a44: lea -0x20(%rcx),%rdi
3a48: mov %rdi,%r10
3a4b: shr $0x5,%r10
3a4f: inc %r10
3a52: mov %r10d,%r9d
3a55: and $0x3,%r9d
3a59: cmp $0x60,%rdi
3a5d: jae 3a63 <__uring_memset+0x63>
3a5f: xor %edi,%edi
3a61: jmp 3aa9 <__uring_memset+0xa9>
3a63: and $0xfffffffffffffffc,%r10
3a67: xor %edi,%edi
3a69: nopl 0x0(%rax)
3a70: movdqu %xmm0,(%rax,%rdi,1)
3a75: movdqu %xmm0,0x10(%rax,%rdi,1)
3a7b: movdqu %xmm0,0x20(%rax,%rdi,1)
3a81: movdqu %xmm0,0x30(%rax,%rdi,1)
3a87: movdqu %xmm0,0x40(%rax,%rdi,1)
3a8d: movdqu %xmm0,0x50(%rax,%rdi,1)
3a93: movdqu %xmm0,0x60(%rax,%rdi,1)
3a99: movdqu %xmm0,0x70(%rax,%rdi,1)
3a9f: sub $0xffffffffffffff80,%rdi
3aa3: add $0xfffffffffffffffc,%r10
3aa7: jne 3a70 <__uring_memset+0x70>
3aa9: test %r9,%r9
3aac: je 3ad6 <__uring_memset+0xd6>
3aae: lea (%rdi,%rax,1),%r10
3ab2: add $0x10,%r10
3ab6: shl $0x5,%r9
3aba: xor %edi,%edi
3abc: nopl 0x0(%rax)
3ac0: movdqu %xmm0,-0x10(%r10,%rdi,1)
3ac7: movdqu %xmm0,(%r10,%rdi,1)
3acd: add $0x20,%rdi
3ad1: cmp %rdi,%r9
3ad4: jne 3ac0 <__uring_memset+0xc0>
3ad6: cmp %rdx,%rcx
3ad9: je 3b2c <__uring_memset+0x12c>
3adb: test $0x18,%dl
3ade: je 3b20 <__uring_memset+0x120>
3ae0: mov %rcx,%rdi
3ae3: mov %rdx,%rcx
3ae6: and $0xfffffffffffffff8,%rcx
3aea: movd %r8d,%xmm0
3aef: punpcklbw %xmm0,%xmm0
3af3: pshuflw $0x0,%xmm0,%xmm0
3af8: nopl 0x0(%rax,%rax,1)
3b00: movq %xmm0,(%rax,%rdi,1)
3b05: add $0x8,%rdi
3b09: cmp %rdi,%rcx
3b0c: jne 3b00 <__uring_memset+0x100>
3b0e: cmp %rdx,%rcx
3b11: je 3b2c <__uring_memset+0x12c>
3b13: data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
3b20: mov %sil,(%rax,%rcx,1)
3b24: inc %rcx
3b27: cmp %rcx,%rdx
3b2a: jne 3b20 <__uring_memset+0x120>
3b2c: ret
3b2d: nopl (%rax)
```
After this patch:
```
0000000000003424 <__uring_memset>:
3424: mov %rdi,%rax
3427: test %rdx,%rdx
342a: je 343a <__uring_memset+0x16>
342c: xor %ecx,%ecx
342e: mov %sil,(%rax,%rcx,1)
3432: inc %rcx
3435: cmp %rcx,%rdx
3438: jne 342e <__uring_memset+0xa>
343a: ret
```
Signed-off-by: Ammar Faizi <[email protected]>
---
src/nolibc.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/nolibc.c b/src/nolibc.c
index 3207e33..ac81575 100644
--- a/src/nolibc.c
+++ b/src/nolibc.c
@@ -12,9 +12,16 @@ void *__uring_memset(void *s, int c, size_t n)
size_t i;
unsigned char *p = s;
- for (i = 0; i < n; i++)
+ for (i = 0; i < n; i++) {
p[i] = (unsigned char) c;
+ /*
+ * An empty inline ASM to avoid auto-vectorization
+ * because it's too bloated for liburing.
+ */
+ __asm__ volatile ("");
+ }
+
return s;
}
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function
2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi
2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi
@ 2023-01-06 15:42 ` Ammar Faizi
2023-01-06 15:59 ` Alviro Iskandar Setiawan
2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe
2 siblings, 1 reply; 6+ messages in thread
From: Ammar Faizi @ 2023-01-06 15:42 UTC (permalink / raw)
To: Jens Axboe
Cc: Ammar Faizi, Pavel Begunkov, Gilang Fachrezy,
VNLX Kernel Department, Alviro Iskandar Setiawan,
GNU/Weeb Mailing List, io-uring Mailing List
From: Ammar Faizi <[email protected]>
Use a struct initializer instead of memset(). It simplifies the C code
plus effectively reduces the code size.
Extra bonus on x86-64. It reduces the stack allocation because it
doesn't need to allocate stack for the local variable @range. It can
just use 128 bytes of redzone below the `%rsp` (redzone is only
available in a leaf function).
Before this patch:
```
0000000000003910 <io_uring_register_file_alloc_range>:
3910: push %rbp
3911: push %r15
3913: push %r14
3915: push %rbx
3916: sub $0x18,%rsp
391a: mov %edx,%r14d
391d: mov %esi,%ebp
391f: mov %rdi,%rbx
3922: lea 0x8(%rsp),%r15
3927: mov $0x10,%edx
392c: mov %r15,%rdi
392f: xor %esi,%esi
3931: call 3a00 <__uring_memset>
3936: mov %ebp,0x8(%rsp)
393a: mov %r14d,0xc(%rsp)
393f: mov 0xc4(%rbx),%edi
3945: mov $0x1ab,%eax
394a: mov $0x19,%esi
394f: mov %r15,%rdx
3952: xor %r10d,%r10d
3955: syscall
3957: add $0x18,%rsp
395b: pop %rbx
395c: pop %r14
395e: pop %r15
3960: pop %rbp
3961: ret
3962: cs nopw 0x0(%rax,%rax,1)
396c: nopl 0x0(%rax)
```
After this patch:
```
0000000000003910 <io_uring_register_file_alloc_range>:
3910: mov %esi,-0x10(%rsp) # set range.off
3914: mov %edx,-0xc(%rsp) # set range.len
3918: movq $0x0,-0x8(%rsp) # zero the resv
3921: mov 0xc4(%rdi),%edi
3927: lea -0x10(%rsp),%rdx
392c: mov $0x1ab,%eax
3931: mov $0x19,%esi
3936: xor %r10d,%r10d
3939: syscall
393b: ret
```
Signed-off-by: Ammar Faizi <[email protected]>
---
src/register.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/src/register.c b/src/register.c
index 5fdc6e5..ac4c9e3 100644
--- a/src/register.c
+++ b/src/register.c
@@ -333,11 +333,10 @@ int io_uring_register_sync_cancel(struct io_uring *ring,
int io_uring_register_file_alloc_range(struct io_uring *ring,
unsigned off, unsigned len)
{
- struct io_uring_file_index_range range;
-
- memset(&range, 0, sizeof(range));
- range.off = off;
- range.len = len;
+ struct io_uring_file_index_range range = {
+ .off = off,
+ .len = len
+ };
return __sys_io_uring_register(ring->ring_fd,
IORING_REGISTER_FILE_ALLOC_RANGE, &range,
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization
2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi
@ 2023-01-06 15:56 ` Alviro Iskandar Setiawan
0 siblings, 0 replies; 6+ messages in thread
From: Alviro Iskandar Setiawan @ 2023-01-06 15:56 UTC (permalink / raw)
To: Ammar Faizi
Cc: Jens Axboe, Pavel Begunkov, Gilang Fachrezy,
VNLX Kernel Department, GNU/Weeb Mailing List,
io-uring Mailing List
On Fri, Jan 6, 2023 at 10:43 PM Ammar Faizi wrote:
> Clang and GCC generate an insane vectorized memset() in nolibc.c.
> liburing doesn't need such a powerful memset(). Add an empty inline ASM
> to prevent the compilers from over-optimizing the memset().
>
> Just for comparison, see the following Assembly code (generated by
> Clang).
>
> Before this patch:
>
> ```
> 0000000000003a00 <__uring_memset>:
> 3a00: mov %rdi,%rax
> 3a03: test %rdx,%rdx
> 3a06: je 3b2c <__uring_memset+0x12c>
> 3a0c: cmp $0x8,%rdx
> 3a10: jae 3a19 <__uring_memset+0x19>
> 3a12: xor %ecx,%ecx
> 3a14: jmp 3b20 <__uring_memset+0x120>
> 3a19: movzbl %sil,%r8d
> 3a1d: cmp $0x20,%rdx
> 3a21: jae 3a2a <__uring_memset+0x2a>
> 3a23: xor %ecx,%ecx
> 3a25: jmp 3ae0 <__uring_memset+0xe0>
> 3a2a: mov %rdx,%rcx
> 3a2d: and $0xffffffffffffffe0,%rcx
> 3a31: movd %r8d,%xmm0
> 3a36: punpcklbw %xmm0,%xmm0
> 3a3a: pshuflw $0x0,%xmm0,%xmm0
> 3a3f: pshufd $0x0,%xmm0,%xmm0
> 3a44: lea -0x20(%rcx),%rdi
> 3a48: mov %rdi,%r10
> 3a4b: shr $0x5,%r10
> 3a4f: inc %r10
> 3a52: mov %r10d,%r9d
> 3a55: and $0x3,%r9d
> 3a59: cmp $0x60,%rdi
> 3a5d: jae 3a63 <__uring_memset+0x63>
> 3a5f: xor %edi,%edi
> 3a61: jmp 3aa9 <__uring_memset+0xa9>
> 3a63: and $0xfffffffffffffffc,%r10
> 3a67: xor %edi,%edi
> 3a69: nopl 0x0(%rax)
> 3a70: movdqu %xmm0,(%rax,%rdi,1)
> 3a75: movdqu %xmm0,0x10(%rax,%rdi,1)
> 3a7b: movdqu %xmm0,0x20(%rax,%rdi,1)
> 3a81: movdqu %xmm0,0x30(%rax,%rdi,1)
> 3a87: movdqu %xmm0,0x40(%rax,%rdi,1)
> 3a8d: movdqu %xmm0,0x50(%rax,%rdi,1)
> 3a93: movdqu %xmm0,0x60(%rax,%rdi,1)
> 3a99: movdqu %xmm0,0x70(%rax,%rdi,1)
> 3a9f: sub $0xffffffffffffff80,%rdi
> 3aa3: add $0xfffffffffffffffc,%r10
> 3aa7: jne 3a70 <__uring_memset+0x70>
> 3aa9: test %r9,%r9
> 3aac: je 3ad6 <__uring_memset+0xd6>
> 3aae: lea (%rdi,%rax,1),%r10
> 3ab2: add $0x10,%r10
> 3ab6: shl $0x5,%r9
> 3aba: xor %edi,%edi
> 3abc: nopl 0x0(%rax)
> 3ac0: movdqu %xmm0,-0x10(%r10,%rdi,1)
> 3ac7: movdqu %xmm0,(%r10,%rdi,1)
> 3acd: add $0x20,%rdi
> 3ad1: cmp %rdi,%r9
> 3ad4: jne 3ac0 <__uring_memset+0xc0>
> 3ad6: cmp %rdx,%rcx
> 3ad9: je 3b2c <__uring_memset+0x12c>
> 3adb: test $0x18,%dl
> 3ade: je 3b20 <__uring_memset+0x120>
> 3ae0: mov %rcx,%rdi
> 3ae3: mov %rdx,%rcx
> 3ae6: and $0xfffffffffffffff8,%rcx
> 3aea: movd %r8d,%xmm0
> 3aef: punpcklbw %xmm0,%xmm0
> 3af3: pshuflw $0x0,%xmm0,%xmm0
> 3af8: nopl 0x0(%rax,%rax,1)
> 3b00: movq %xmm0,(%rax,%rdi,1)
> 3b05: add $0x8,%rdi
> 3b09: cmp %rdi,%rcx
> 3b0c: jne 3b00 <__uring_memset+0x100>
> 3b0e: cmp %rdx,%rcx
> 3b11: je 3b2c <__uring_memset+0x12c>
> 3b13: data16 data16 data16 cs nopw 0x0(%rax,%rax,1)
> 3b20: mov %sil,(%rax,%rcx,1)
> 3b24: inc %rcx
> 3b27: cmp %rcx,%rdx
> 3b2a: jne 3b20 <__uring_memset+0x120>
> 3b2c: ret
> 3b2d: nopl (%rax)
> ```
>
> After this patch:
>
> ```
> 0000000000003424 <__uring_memset>:
> 3424: mov %rdi,%rax
> 3427: test %rdx,%rdx
> 342a: je 343a <__uring_memset+0x16>
> 342c: xor %ecx,%ecx
> 342e: mov %sil,(%rax,%rcx,1)
> 3432: inc %rcx
> 3435: cmp %rcx,%rdx
> 3438: jne 342e <__uring_memset+0xa>
> 343a: ret
> ```
>
> Signed-off-by: Ammar Faizi <[email protected]>
Reviewed-by: Alviro Iskandar Setiawan <[email protected]>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function
2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi
@ 2023-01-06 15:59 ` Alviro Iskandar Setiawan
0 siblings, 0 replies; 6+ messages in thread
From: Alviro Iskandar Setiawan @ 2023-01-06 15:59 UTC (permalink / raw)
To: Ammar Faizi
Cc: Jens Axboe, Pavel Begunkov, Gilang Fachrezy,
VNLX Kernel Department, GNU/Weeb Mailing List,
io-uring Mailing List
On Fri, Jan 6, 2023 at 10:43 PM Ammar Faizi wrote:
> Use a struct initializer instead of memset(). It simplifies the C code
> plus effectively reduces the code size.
>
> Extra bonus on x86-64. It reduces the stack allocation because it
> doesn't need to allocate stack for the local variable @range. It can
> just use 128 bytes of redzone below the `%rsp` (redzone is only
> available in a leaf function).
>
> Before this patch:
>
> ```
> 0000000000003910 <io_uring_register_file_alloc_range>:
> 3910: push %rbp
> 3911: push %r15
> 3913: push %r14
> 3915: push %rbx
> 3916: sub $0x18,%rsp
> 391a: mov %edx,%r14d
> 391d: mov %esi,%ebp
> 391f: mov %rdi,%rbx
> 3922: lea 0x8(%rsp),%r15
> 3927: mov $0x10,%edx
> 392c: mov %r15,%rdi
> 392f: xor %esi,%esi
> 3931: call 3a00 <__uring_memset>
> 3936: mov %ebp,0x8(%rsp)
> 393a: mov %r14d,0xc(%rsp)
> 393f: mov 0xc4(%rbx),%edi
> 3945: mov $0x1ab,%eax
> 394a: mov $0x19,%esi
> 394f: mov %r15,%rdx
> 3952: xor %r10d,%r10d
> 3955: syscall
> 3957: add $0x18,%rsp
> 395b: pop %rbx
> 395c: pop %r14
> 395e: pop %r15
> 3960: pop %rbp
> 3961: ret
> 3962: cs nopw 0x0(%rax,%rax,1)
> 396c: nopl 0x0(%rax)
> ```
>
> After this patch:
>
> ```
> 0000000000003910 <io_uring_register_file_alloc_range>:
> 3910: mov %esi,-0x10(%rsp) # set range.off
> 3914: mov %edx,-0xc(%rsp) # set range.len
> 3918: movq $0x0,-0x8(%rsp) # zero the resv
> 3921: mov 0xc4(%rdi),%edi
> 3927: lea -0x10(%rsp),%rdx
> 392c: mov $0x1ab,%eax
> 3931: mov $0x19,%esi
> 3936: xor %r10d,%r10d
> 3939: syscall
> 393b: ret
> ```
>
> Signed-off-by: Ammar Faizi <[email protected]>
Reviewed-by: Alviro Iskandar Setiawan <[email protected]>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH liburing v1 0/2] liburing micro-optimzation
2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi
2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi
2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi
@ 2023-01-06 17:08 ` Jens Axboe
2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2023-01-06 17:08 UTC (permalink / raw)
To: Ammar Faizi
Cc: Pavel Begunkov, Gilang Fachrezy, VNLX Kernel Department,
Alviro Iskandar Setiawan, GNU/Weeb Mailing List,
io-uring Mailing List
On Fri, 06 Jan 2023 22:42:57 +0700, Ammar Faizi wrote:
> This series contains liburing micro-optimzation. There are two patches
> in this series:
>
> ## Patch 1
> - Fix bloated memset due to unexpected vectorization.
> Clang and GCC generate an insane vectorized memset() in nolibc.c.
> liburing doesn't need such a powerful memset(). Add an empty inline ASM
> to prevent the compilers from over-optimizing the memset().
>
> [...]
Applied, thanks!
[1/2] nolibc: Fix bloated memset due to unexpected vectorization
commit: 913ca9a93fd67a5e5a911d71a33a6de7a1a41101
[2/2] register: Simplify `io_uring_register_file_alloc_range()` function
commit: 8ab80b483518d51903c9eed24cf0e1ba826010fc
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-01-06 17:08 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-06 15:42 [PATCH liburing v1 0/2] liburing micro-optimzation Ammar Faizi
2023-01-06 15:42 ` [PATCH liburing v1 1/2] nolibc: Fix bloated memset due to unexpected vectorization Ammar Faizi
2023-01-06 15:56 ` Alviro Iskandar Setiawan
2023-01-06 15:42 ` [PATCH liburing v1 2/2] register: Simplify `io_uring_register_file_alloc_range()` function Ammar Faizi
2023-01-06 15:59 ` Alviro Iskandar Setiawan
2023-01-06 17:08 ` [PATCH liburing v1 0/2] liburing micro-optimzation Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox