* [RFC PATCH v1 1/5] tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
@ 2023-08-30 13:57 ` Ammar Faizi
2023-08-30 13:57 ` [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()` Ammar Faizi
` (5 subsequent siblings)
6 siblings, 0 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 13:57 UTC (permalink / raw)
To: Willy Tarreau, Thomas Weißschuh
Cc: Ammar Faizi, Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
Simplify memcpy() and memmove() on the x86-64 arch.
The x86-64 arch has a 'rep movsb' instruction, which can perform
memcpy() using only a single instruction, given:
%rdi = destination
%rsi = source
%rcx = length
Additionally, it can also handle the overlapping case by setting DF=1
(backward copy), which can be used as the memmove() implementation.
Before this patch:
```
00000000000010ab <memmove>:
10ab: 48 89 f8 mov %rdi,%rax
10ae: 31 c9 xor %ecx,%ecx
10b0: 48 39 f7 cmp %rsi,%rdi
10b3: 48 83 d1 ff adc $0xffffffffffffffff,%rcx
10b7: 48 85 d2 test %rdx,%rdx
10ba: 74 25 je 10e1 <memmove+0x36>
10bc: 48 83 c9 01 or $0x1,%rcx
10c0: 48 39 f0 cmp %rsi,%rax
10c3: 48 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%rdi
10ca: 48 0f 43 fa cmovae %rdx,%rdi
10ce: 48 01 cf add %rcx,%rdi
10d1: 44 8a 04 3e mov (%rsi,%rdi,1),%r8b
10d5: 44 88 04 38 mov %r8b,(%rax,%rdi,1)
10d9: 48 01 cf add %rcx,%rdi
10dc: 48 ff ca dec %rdx
10df: 75 f0 jne 10d1 <memmove+0x26>
10e1: c3 ret
00000000000010e2 <memcpy>:
10e2: 48 89 f8 mov %rdi,%rax
10e5: 48 85 d2 test %rdx,%rdx
10e8: 74 12 je 10fc <memcpy+0x1a>
10ea: 31 c9 xor %ecx,%ecx
10ec: 40 8a 3c 0e mov (%rsi,%rcx,1),%dil
10f0: 40 88 3c 08 mov %dil,(%rax,%rcx,1)
10f4: 48 ff c1 inc %rcx
10f7: 48 39 ca cmp %rcx,%rdx
10fa: 75 f0 jne 10ec <memcpy+0xa>
10fc: c3 ret
```
After this patch:
```
0000000000001093 <memmove>:
1093: 48 89 f8 mov %rdi,%rax
1096: 48 89 d1 mov %rdx,%rcx
1099: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi
109e: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi
10a3: fd std
10a4: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
10a6: fc cld
10a7: c3 ret
00000000000010a8 <memcpy>:
10a8: 48 89 f8 mov %rdi,%rax
10ab: 48 89 d1 mov %rdx,%rcx
10ae: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
10b0: c3 ret
```
Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/arch-x86_64.h | 28 ++++++++++++++++++++++++++++
tools/include/nolibc/string.h | 4 ++++
2 files changed, 32 insertions(+)
diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index e5ccb926c90306b6..c5162170a2ccdff1 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -173,4 +173,32 @@ void __attribute__((weak, noreturn, optimize("Os", "omit-frame-pointer"))) __no_
__builtin_unreachable();
}
+#define NOLIBC_ARCH_HAS_MEMMOVE
+void *memmove(void *dst, const void *src, size_t len);
+
+#define NOLIBC_ARCH_HAS_MEMCPY
+void *memcpy(void *dst, const void *src, size_t len);
+
+__asm__ (
+".section .text.nolibc_memmove\n"
+".weak memmove\n"
+"memmove:\n"
+ "movq %rdi, %rax\n"
+ "movq %rdx, %rcx\n"
+ "leaq -1(%rdi, %rcx), %rdi\n"
+ "leaq -1(%rsi, %rcx), %rsi\n"
+ "std\n"
+ "rep movsb\n"
+ "cld\n"
+ "retq\n"
+
+".section .text.nolibc_memcpy\n"
+".weak memcpy\n"
+"memcpy:\n"
+ "movq %rdi, %rax\n"
+ "movq %rdx, %rcx\n"
+ "rep movsb\n"
+ "retq\n"
+);
+
#endif /* _NOLIBC_ARCH_X86_64_H */
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 0c2e06c7c4772bc6..6eca267ec6fa7177 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -49,6 +49,7 @@ void *_nolibc_memcpy_down(void *dst, const void *src, size_t len)
return dst;
}
+#ifndef NOLIBC_ARCH_HAS_MEMMOVE
/* might be ignored by the compiler without -ffreestanding, then found as
* missing.
*/
@@ -72,13 +73,16 @@ void *memmove(void *dst, const void *src, size_t len)
}
return dst;
}
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMMOVE */
+#ifndef NOLIBC_ARCH_HAS_MEMCPY
/* must be exported, as it's used by libgcc on ARM */
__attribute__((weak,unused,section(".text.nolibc_memcpy")))
void *memcpy(void *dst, const void *src, size_t len)
{
return _nolibc_memcpy_up(dst, src, len);
}
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCPY */
/* might be ignored by the compiler without -ffreestanding, then found as
* missing.
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
2023-08-30 13:57 ` [RFC PATCH v1 1/5] tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()` Ammar Faizi
@ 2023-08-30 13:57 ` Ammar Faizi
2023-08-30 14:08 ` Alviro Iskandar Setiawan
2023-08-30 14:24 ` Alviro Iskandar Setiawan
2023-08-30 13:57 ` [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()` Ammar Faizi
` (4 subsequent siblings)
6 siblings, 2 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 13:57 UTC (permalink / raw)
To: Willy Tarreau, Thomas Weißschuh
Cc: Ammar Faizi, Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
Simplify memset() on the x86-64 arch.
The x86-64 arch has a 'rep stosb' instruction, which can perform
memset() using only a single instruction, given:
%al = value (just like the second argument of memset())
%rdi = destination
%rcx = length
Before this patch:
```
00000000000010c9 <memset>:
10c9: 48 89 f8 mov %rdi,%rax
10cc: 48 85 d2 test %rdx,%rdx
10cf: 74 0e je 10df <memset+0x16>
10d1: 31 c9 xor %ecx,%ecx
10d3: 40 88 34 08 mov %sil,(%rax,%rcx,1)
10d7: 48 ff c1 inc %rcx
10da: 48 39 ca cmp %rcx,%rdx
10dd: 75 f4 jne 10d3 <memset+0xa>
10df: c3 ret
```
After this patch:
```
00000000000010b1 <memset>:
10b1: 48 89 f0 mov %rsi,%rax
10b4: 48 89 d1 mov %rdx,%rcx
10b7: 48 89 fa mov %rdi,%rdx
10ba: f3 aa rep stos %al,%es:(%rdi)
10bc: 48 89 d0 mov %rdx,%rax
10bf: c3 ret
```
Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/arch-x86_64.h | 13 +++++++++++++
tools/include/nolibc/string.h | 2 ++
2 files changed, 15 insertions(+)
diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index c5162170a2ccdff1..42f2674ad1ecdd64 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -179,6 +179,9 @@ void *memmove(void *dst, const void *src, size_t len);
#define NOLIBC_ARCH_HAS_MEMCPY
void *memcpy(void *dst, const void *src, size_t len);
+#define NOLIBC_ARCH_HAS_MEMSET
+void *memset(void *dst, int c, size_t len);
+
__asm__ (
".section .text.nolibc_memmove\n"
".weak memmove\n"
@@ -199,6 +202,16 @@ __asm__ (
"movq %rdx, %rcx\n"
"rep movsb\n"
"retq\n"
+
+".section .text.nolibc_memset\n"
+".weak memset\n"
+"memset:\n"
+ "movq %rsi, %rax\n"
+ "movq %rdx, %rcx\n"
+ "movq %rdi, %rdx\n"
+ "rep stosb\n"
+ "movq %rdx, %rax\n"
+ "retq\n"
);
#endif /* _NOLIBC_ARCH_X86_64_H */
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 6eca267ec6fa7177..1bad6121ef8c4ab5 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -84,6 +84,7 @@ void *memcpy(void *dst, const void *src, size_t len)
}
#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCPY */
+#ifndef NOLIBC_ARCH_HAS_MEMSET
/* might be ignored by the compiler without -ffreestanding, then found as
* missing.
*/
@@ -99,6 +100,7 @@ void *memset(void *dst, int b, size_t len)
}
return dst;
}
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMSET */
static __attribute__((unused))
char *strchr(const char *s, int c)
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 13:57 ` [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()` Ammar Faizi
@ 2023-08-30 14:08 ` Alviro Iskandar Setiawan
2023-08-30 14:13 ` Ammar Faizi
2023-08-30 14:24 ` Alviro Iskandar Setiawan
1 sibling, 1 reply; 29+ messages in thread
From: Alviro Iskandar Setiawan @ 2023-08-30 14:08 UTC (permalink / raw)
To: Ammar Faizi
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 8:57 PM Ammar Faizi wrote:
> +".section .text.nolibc_memset\n"
> +".weak memset\n"
> +"memset:\n"
> + "movq %rsi, %rax\n"
> + "movq %rdx, %rcx\n"
> + "movq %rdi, %rdx\n"
> + "rep stosb\n"
> + "movq %rdx, %rax\n"
> + "retq\n"
The first instruction could be:
movl %esi, %eax
That's smaller. Also, the second argument of memset() is an int
anyway, so there is no need to have a full 64-bit copy of %rsi in
%rax.
-- Viro
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 14:08 ` Alviro Iskandar Setiawan
@ 2023-08-30 14:13 ` Ammar Faizi
0 siblings, 0 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 14:13 UTC (permalink / raw)
To: Alviro Iskandar Setiawan
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 09:08:05PM +0700, Alviro Iskandar Setiawan wrote:
> The first instruction could be:
>
> movl %esi, %eax
>
> That's smaller. Also, the second argument of memset() is an int
> anyway, so there is no need to have a full 64-bit copy of %rsi in
> %rax.
Agree, noted.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 13:57 ` [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()` Ammar Faizi
2023-08-30 14:08 ` Alviro Iskandar Setiawan
@ 2023-08-30 14:24 ` Alviro Iskandar Setiawan
2023-08-30 15:09 ` Ammar Faizi
1 sibling, 1 reply; 29+ messages in thread
From: Alviro Iskandar Setiawan @ 2023-08-30 14:24 UTC (permalink / raw)
To: Ammar Faizi
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 8:57 PM Ammar Faizi wrote:
> 00000000000010b1 <memset>:
> 10b1: 48 89 f0 mov %rsi,%rax
> 10b4: 48 89 d1 mov %rdx,%rcx
> 10b7: 48 89 fa mov %rdi,%rdx
> 10ba: f3 aa rep stos %al,%es:(%rdi)
> 10bc: 48 89 d0 mov %rdx,%rax
> 10bf: c3 ret
Just a small idea to shrink this more, "mov %rdi, %rdx" and "mov %rdx,
%rax" can be replaced with "push %rdi" and "pop %rax" (they are just a
byte). So we can save 4 bytes more.
0000000000001500 <memset>:
1500: 48 89 f0 mov %rsi,%rax
1503: 48 89 d1 mov %rdx,%rcx
1506: 57 push %rdi
1507: f3 aa rep stos %al,%es:(%rdi)
1509: 58 pop %rax
150a: c3 ret
But I know you don't like it because it costs extra memory access.
-- Viro
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 14:24 ` Alviro Iskandar Setiawan
@ 2023-08-30 15:09 ` Ammar Faizi
2023-08-30 15:23 ` Willy Tarreau
0 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 15:09 UTC (permalink / raw)
To: Alviro Iskandar Setiawan
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 09:24:45PM +0700, Alviro Iskandar Setiawan wrote:
> Just a small idea to shrink this more, "mov %rdi, %rdx" and "mov %rdx,
> %rax" can be replaced with "push %rdi" and "pop %rax" (they are just a
> byte). So we can save 4 bytes more.
>
> 0000000000001500 <memset>:
> 1500: 48 89 f0 mov %rsi,%rax
> 1503: 48 89 d1 mov %rdx,%rcx
> 1506: 57 push %rdi
> 1507: f3 aa rep stos %al,%es:(%rdi)
> 1509: 58 pop %rax
> 150a: c3 ret
>
> But I know you don't like it because it costs extra memory access.
Yes, that's an extra memory access. But I believe it doesn't hurt
someone targetting -Os. In many cases, the compilers use push/pop to
align the stack before a 'call' instruction. If they want to avoid extra
memory access, they could have used "subq $8, %rsp" and "addq $8, %rsp".
For example: https://godbolt.org/z/Tzc1xWGEn
C code:
```
int fx(int b);
int fy(int a)
{
return 1 + fx(a);
}
```
Targetting -Os, both clang and gcc compile it to:
```
fy:
pushq %rax
call fx
popq %rdx
incl %eax
ret
```
Targetting -O2:
```
fy:
subq $8, %rsp
call fx
addq $8, %rsp
addl $1, %eax
ret
```
That pushq/popq pair doesn't actually preserve anything; it's just to
align the %rsp at 16 bytes on 'call'. IOW, sometimes having extra memory
access to get a smaller code size is acceptable.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 15:09 ` Ammar Faizi
@ 2023-08-30 15:23 ` Willy Tarreau
2023-08-30 15:44 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: Willy Tarreau @ 2023-08-30 15:23 UTC (permalink / raw)
To: Ammar Faizi
Cc: Alviro Iskandar Setiawan, Thomas Weißschuh,
Nicholas Rosenberg, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 10:09:51PM +0700, Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 09:24:45PM +0700, Alviro Iskandar Setiawan wrote:
> > Just a small idea to shrink this more, "mov %rdi, %rdx" and "mov %rdx,
> > %rax" can be replaced with "push %rdi" and "pop %rax" (they are just a
> > byte). So we can save 4 bytes more.
> >
> > 0000000000001500 <memset>:
> > 1500: 48 89 f0 mov %rsi,%rax
> > 1503: 48 89 d1 mov %rdx,%rcx
> > 1506: 57 push %rdi
> > 1507: f3 aa rep stos %al,%es:(%rdi)
> > 1509: 58 pop %rax
> > 150a: c3 ret
> >
> > But I know you don't like it because it costs extra memory access.
>
> Yes, that's an extra memory access. But I believe it doesn't hurt
> someone targetting -Os. In many cases, the compilers use push/pop to
> align the stack before a 'call' instruction. If they want to avoid extra
> memory access, they could have used "subq $8, %rsp" and "addq $8, %rsp".
Then "xchg %esi, %eax" is just one byte with no memory access ;-)
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 15:23 ` Willy Tarreau
@ 2023-08-30 15:44 ` Ammar Faizi
2023-08-30 15:51 ` Willy Tarreau
0 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 15:44 UTC (permalink / raw)
To: Willy Tarreau
Cc: Alviro Iskandar Setiawan, Thomas Weißschuh,
Nicholas Rosenberg, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 05:23:22PM +0200, Willy Tarreau wrote:
> Then "xchg %esi, %eax" is just one byte with no memory access ;-)
Perfect!
Now I got this, shorter than "movl %esi, %eax":
```
0000000000001500 <memset>:
1500: 96 xchg %eax,%esi
1501: 48 89 d1 mov %rdx,%rcx
1504: 57 push %rdi
1505: f3 aa rep stos %al,%es:(%rdi)
1507: 58 pop %rax
1508: c3 ret
```
Unfortunately, the xchg trick doesn't yield smaller machine code for
%rdx, %rcx. Lol.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 15:44 ` Ammar Faizi
@ 2023-08-30 15:51 ` Willy Tarreau
2023-08-30 16:08 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: Willy Tarreau @ 2023-08-30 15:51 UTC (permalink / raw)
To: Ammar Faizi
Cc: Alviro Iskandar Setiawan, Thomas Weißschuh,
Nicholas Rosenberg, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 10:44:53PM +0700, Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 05:23:22PM +0200, Willy Tarreau wrote:
> > Then "xchg %esi, %eax" is just one byte with no memory access ;-)
>
> Perfect!
>
> Now I got this, shorter than "movl %esi, %eax":
> ```
> 0000000000001500 <memset>:
> 1500: 96 xchg %eax,%esi
> 1501: 48 89 d1 mov %rdx,%rcx
> 1504: 57 push %rdi
> 1505: f3 aa rep stos %al,%es:(%rdi)
> 1507: 58 pop %rax
> 1508: c3 ret
> ```
>
> Unfortunately, the xchg trick doesn't yield smaller machine code for
> %rdx, %rcx. Lol.
Normal, that's because historically "xchg ax, regX" was a single-byte 0x9X
on 8086, then it turned to 32-bit keeping the same encoding, like many
instructions (note that NOP is encoded as xchg ax,ax). It remains short
when you can sacrifice the other register, or restore it later using yet
another xchg. For rcx/rdx a push/pop could do it as they should also be
a single-byte 0x5X even in long mode unless I'm mistaken. Thus if you
absolutely want to squeeze that 9th byte to end up with a 8-byte function
you could probably do:
xchg %eax, %esi 1
push %rdx 1
pop %rcx 1
push %rdi 1
rep movsb 2
pop %rax 1
ret 1
------------- Total: 8 bytes :-)
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 15:51 ` Willy Tarreau
@ 2023-08-30 16:08 ` Ammar Faizi
2023-08-30 16:11 ` Alviro Iskandar Setiawan
0 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 16:08 UTC (permalink / raw)
To: Willy Tarreau
Cc: Alviro Iskandar Setiawan, Thomas Weißschuh,
Nicholas Rosenberg, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 05:51:52PM +0200, Willy Tarreau wrote:
> Normal, that's because historically "xchg ax, regX" was a single-byte 0x9X
> on 8086, then it turned to 32-bit keeping the same encoding, like many
> instructions (note that NOP is encoded as xchg ax,ax). It remains short
> when you can sacrifice the other register, or restore it later using yet
> another xchg. For rcx/rdx a push/pop could do it as they should also be
> a single-byte 0x5X even in long mode unless I'm mistaken. Thus if you
> absolutely want to squeeze that 9th byte to end up with a 8-byte function
> you could probably do:
>
> xchg %eax, %esi 1
> push %rdx 1
> pop %rcx 1
> push %rdi 1
> rep movsb 2 [sic]
> pop %rax 1
> ret 1
> ------------- Total: 8 bytes :-)
Fun!
We're not doing a code golf game, though. So, I think I will leave the
"mov %rdx, %rcx" as is. Otherwise, I would be tempted to do that all
over the place.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()`
2023-08-30 16:08 ` Ammar Faizi
@ 2023-08-30 16:11 ` Alviro Iskandar Setiawan
0 siblings, 0 replies; 29+ messages in thread
From: Alviro Iskandar Setiawan @ 2023-08-30 16:11 UTC (permalink / raw)
To: Ammar Faizi
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 11:08 PM Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 05:51:52PM +0200, Willy Tarreau wrote:
> > xchg %eax, %esi 1
> > push %rdx 1
> > pop %rcx 1
> > push %rdi 1
> > rep movsb 2 [sic]
> > pop %rax 1
> > ret 1
> > ------------- Total: 8 bytes :-)
That's beautiful!
-- Viro
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
2023-08-30 13:57 ` [RFC PATCH v1 1/5] tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()` Ammar Faizi
2023-08-30 13:57 ` [RFC PATCH v1 2/5] tools/nolibc: x86-64: Use `rep stosb` for `memset()` Ammar Faizi
@ 2023-08-30 13:57 ` Ammar Faizi
2023-08-30 21:26 ` Willy Tarreau
2023-08-30 13:57 ` [RFC PATCH v1 4/5] tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function Ammar Faizi
` (3 subsequent siblings)
6 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 13:57 UTC (permalink / raw)
To: Willy Tarreau, Thomas Weißschuh
Cc: Ammar Faizi, Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
Simplify memcmp() on the x86-64 arch.
The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
implement the memcmp() function.
%rdi = source 1
%rsi = source 2
%rcx = length
Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
tools/include/nolibc/string.h | 2 ++
2 files changed, 21 insertions(+)
diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -214,4 +214,23 @@ __asm__ (
"retq\n"
);
+#define NOLIBC_ARCH_HAS_MEMCMP
+static int memcmp(const void *s1, const void *s2, size_t n)
+{
+ const unsigned char *p1 = s1;
+ const unsigned char *p2 = s2;
+
+ if (!n)
+ return 0;
+
+ __asm__ volatile (
+ "rep cmpsb"
+ : "+D"(p2), "+S"(p1), "+c"(n)
+ : "m"(*(const unsigned char (*)[n])s1),
+ "m"(*(const unsigned char (*)[n])s2)
+ );
+
+ return p1[-1] - p2[-1];
+}
+
#endif /* _NOLIBC_ARCH_X86_64_H */
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 1bad6121ef8c4ab5..3c941289d5dd0985 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -15,6 +15,7 @@ static void *malloc(size_t len);
* As much as possible, please keep functions alphabetically sorted.
*/
+#ifndef NOLIBC_ARCH_HAS_MEMCMP
static __attribute__((unused))
int memcmp(const void *s1, const void *s2, size_t n)
{
@@ -26,6 +27,7 @@ int memcmp(const void *s1, const void *s2, size_t n)
}
return c1;
}
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCMP */
static __attribute__((unused))
void *_nolibc_memcpy_up(void *dst, const void *src, size_t len)
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-08-30 13:57 ` [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()` Ammar Faizi
@ 2023-08-30 21:26 ` Willy Tarreau
2023-09-01 3:24 ` Ammar Faizi
2023-09-04 8:26 ` David Laight
0 siblings, 2 replies; 29+ messages in thread
From: Willy Tarreau @ 2023-08-30 21:26 UTC (permalink / raw)
To: Ammar Faizi
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 08:57:24PM +0700, Ammar Faizi wrote:
> Simplify memcmp() on the x86-64 arch.
>
> The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
> implement the memcmp() function.
>
> %rdi = source 1
> %rsi = source 2
> %rcx = length
>
> Signed-off-by: Ammar Faizi <[email protected]>
> ---
> tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
> tools/include/nolibc/string.h | 2 ++
> 2 files changed, 21 insertions(+)
>
> diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
> index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
> --- a/tools/include/nolibc/arch-x86_64.h
> +++ b/tools/include/nolibc/arch-x86_64.h
> @@ -214,4 +214,23 @@ __asm__ (
> "retq\n"
> );
>
> +#define NOLIBC_ARCH_HAS_MEMCMP
> +static int memcmp(const void *s1, const void *s2, size_t n)
> +{
> + const unsigned char *p1 = s1;
> + const unsigned char *p2 = s2;
> +
> + if (!n)
> + return 0;
> +
> + __asm__ volatile (
> + "rep cmpsb"
> + : "+D"(p2), "+S"(p1), "+c"(n)
> + : "m"(*(const unsigned char (*)[n])s1),
> + "m"(*(const unsigned char (*)[n])s2)
> + );
> +
> + return p1[-1] - p2[-1];
> +}
Out of curiosity, given that you implemented the 3 other ones directly
in an asm statement, is there a particular reason this one mixes a bit
of C and asm ? It would probably be something around this, in the same
vein:
memcmp:
xchg %esi,%eax // source1
mov %rdx,%rcx // count
rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
seta %al // 0 if src2 <= src1, 1 if src2 > src1
sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
movsx %al, %eax // sign extend to %eax
ret
Note that the output logic could have to be revisited, I'm not certain but
at first glance it looks valid.
Regards,
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-08-30 21:26 ` Willy Tarreau
@ 2023-09-01 3:24 ` Ammar Faizi
2023-09-01 3:35 ` Willy Tarreau
2023-09-04 8:26 ` David Laight
1 sibling, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-09-01 3:24 UTC (permalink / raw)
To: Willy Tarreau
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 11:26:57PM +0200, Willy Tarreau wrote:
> Out of curiosity, given that you implemented the 3 other ones directly
> in an asm statement, is there a particular reason this one mixes a bit
> of C and asm ?
Because this one maybe unused. The other are explicitly exported.
> It would probably be something around this, in the same vein:
>
> memcmp:
> xchg %esi,%eax // source1
> mov %rdx,%rcx // count
> rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
> seta %al // 0 if src2 <= src1, 1 if src2 > src1
> sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> movsx %al, %eax // sign extend to %eax
> ret
>
> Note that the output logic could have to be revisited, I'm not certain but
> at first glance it looks valid.
After thinking about this more, I think I'll drop the memcmp() patch
because it will prevent optimization when comparing a small value.
For example, without __asm__:
memcmp(var, "abcd", 4);
may compile to:
cmpl $0x64636261, %reg
...something...
But with __asm__, the compiler can't do that. Thus, it's not worth
optimizing the memcmp() in this case.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-09-01 3:24 ` Ammar Faizi
@ 2023-09-01 3:35 ` Willy Tarreau
2023-09-01 7:27 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: Willy Tarreau @ 2023-09-01 3:35 UTC (permalink / raw)
To: Ammar Faizi
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 11:26:57PM +0200, Willy Tarreau wrote:
> > Out of curiosity, given that you implemented the 3 other ones directly
> > in an asm statement, is there a particular reason this one mixes a bit
> > of C and asm ?
>
> Because this one maybe unused. The other are explicitly exported.
Makes sense, indeed.
> > It would probably be something around this, in the same vein:
> >
> > memcmp:
> > xchg %esi,%eax // source1
> > mov %rdx,%rcx // count
> > rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
> > seta %al // 0 if src2 <= src1, 1 if src2 > src1
> > sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> > movsx %al, %eax // sign extend to %eax
> > ret
> >
> > Note that the output logic could have to be revisited, I'm not certain but
> > at first glance it looks valid.
>
> After thinking about this more, I think I'll drop the memcmp() patch
> because it will prevent optimization when comparing a small value.
>
> For example, without __asm__:
>
> memcmp(var, "abcd", 4);
>
> may compile to:
>
> cmpl $0x64636261, %reg
> ...something...
>
> But with __asm__, the compiler can't do that. Thus, it's not worth
> optimizing the memcmp() in this case.
Ah you're totally right!
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-09-01 3:35 ` Willy Tarreau
@ 2023-09-01 7:27 ` Ammar Faizi
0 siblings, 0 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-09-01 7:27 UTC (permalink / raw)
To: Willy Tarreau
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 05:35:08AM +0200, Willy Tarreau wrote:
> On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> > After thinking about this more, I think I'll drop the memcmp() patch
> > because it will prevent optimization when comparing a small value.
> >
> > For example, without __asm__:
> >
> > memcmp(var, "abcd", 4);
> >
> > may compile to:
> >
> > cmpl $0x64636261, %reg
> > ...something...
> >
> > But with __asm__, the compiler can't do that. Thus, it's not worth
> > optimizing the memcmp() in this case.
>
> Ah you're totally right!
So, it turns out that such assumption is wrong. The compiler cannot
optimize the current memcmp() into that. I just posted a question on SO:
https://stackoverflow.com/questions/77020562/what-prevents-the-compiler-from-optimizing-a-hand-written-memcmp
Given:
```
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
```
The result when using default the <string.h> memcmp (good):
```
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
```
The result when using nolibc memcmp() (bad):
```
test_data:
cmpb $97, (%rdi)
jne .L5
cmpb $98, 1(%rdi)
jne .L5
cmpb $99, 2(%rdi)
jne .L5
cmpb $100, 3(%rdi)
sete %al
ret
.L5:
xorl %eax, %eax
ret
```
Link: https://godbolt.org/z/TT94r3bvf
This is because apart from the input length, the current nolibc
`memcmp()` must stop comparing the next byte if it finds a non-match
byte. Imagine what happens if we call:
```
char xstr[] = {'a', 'b', 'x'};
test_data(x);
```
In that case, the compiler may read past xstr if it uses a dword cmp, it
can also lead to segfault in particular circumstances using a dword cmp.
What the current nolibc memcmp() does from the C language view:
1) Compare one byte at a time.
2) Must stop comparing the next byte if it finds a non-match byte.
Because point (2) comes in, the compiler is not allowed to optimize
nolibc memcmp() into a wider load; otherwise, it may hit a segfault.
That also means it cannot vectorize the memcmp() loop.
On the other hand, memcpy() and memset() don't have such a restriction
so they can vectorize.
The real memcmp() assumes that both sources are at least `n` length in
size, allowing for a wider load. The current nolibc memcmp()
implementation doesn't reflect that assumption in the C code.
IOW, the real built-in memcmp() is undefined behavior for this code:
```
char x = 'q';
return memcmp(&x, "abcd", 4);
```
but the current nolibc memcmp() is well-defined behavior (well, must be,
as what the C code reflects).
We can improve nolibc memcmp() by casting the sources to a wider type
like (ulong, uint, ushort). But that's another story for another RFC
patchset.
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
2023-08-30 21:26 ` Willy Tarreau
2023-09-01 3:24 ` Ammar Faizi
@ 2023-09-04 8:26 ` David Laight
1 sibling, 0 replies; 29+ messages in thread
From: David Laight @ 2023-09-04 8:26 UTC (permalink / raw)
To: 'Willy Tarreau', Ammar Faizi
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
From: Willy Tarreau
> Sent: 30 August 2023 22:27
>
> On Wed, Aug 30, 2023 at 08:57:24PM +0700, Ammar Faizi wrote:
> > Simplify memcmp() on the x86-64 arch.
> >
> > The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
> > implement the memcmp() function.
> >
> > %rdi = source 1
> > %rsi = source 2
> > %rcx = length
> >
> > Signed-off-by: Ammar Faizi <[email protected]>
> > ---
> > tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
> > tools/include/nolibc/string.h | 2 ++
> > 2 files changed, 21 insertions(+)
> >
> > diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
> > index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
> > --- a/tools/include/nolibc/arch-x86_64.h
> > +++ b/tools/include/nolibc/arch-x86_64.h
> > @@ -214,4 +214,23 @@ __asm__ (
> > "retq\n"
> > );
> >
> > +#define NOLIBC_ARCH_HAS_MEMCMP
> > +static int memcmp(const void *s1, const void *s2, size_t n)
> > +{
> > + const unsigned char *p1 = s1;
> > + const unsigned char *p2 = s2;
> > +
> > + if (!n)
> > + return 0;
> > +
> > + __asm__ volatile (
> > + "rep cmpsb"
> > + : "+D"(p2), "+S"(p1), "+c"(n)
> > + : "m"(*(const unsigned char (*)[n])s1),
> > + "m"(*(const unsigned char (*)[n])s2)
> > + );
> > +
> > + return p1[-1] - p2[-1];
> > +}
>
> Out of curiosity, given that you implemented the 3 other ones directly
> in an asm statement, is there a particular reason this one mixes a bit
> of C and asm ? It would probably be something around this, in the same
> vein:
>
> memcmp:
> xchg %esi,%eax // source1
Aren't the arguments in %rdi, %rsi and %rdx so you only
need to move the count (below) ?
(Looks like you copied memchr())
David
> mov %rdx,%rcx // count
> rep cmpsb // source2 in rdi; sets ZF on equal, CF if src1<src2
> seta %al // 0 if src2 <= src1, 1 if src2 > src1
> sbb $0, %al // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> movsx %al, %eax // sign extend to %eax
> ret
>
> Note that the output logic could have to be revisited, I'm not certain but
> at first glance it looks valid.
>
> Regards,
> Willy
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH v1 4/5] tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
` (2 preceding siblings ...)
2023-08-30 13:57 ` [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()` Ammar Faizi
@ 2023-08-30 13:57 ` Ammar Faizi
2023-08-30 21:27 ` Willy Tarreau
2023-08-30 13:57 ` [RFC PATCH v1 5/5] tools/nolibc: string: Remove the `_nolibc_memcpy_up()` function Ammar Faizi
` (2 subsequent siblings)
6 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 13:57 UTC (permalink / raw)
To: Willy Tarreau, Thomas Weißschuh
Cc: Ammar Faizi, Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
This nolibc internal function is not used. Delete it. It was probably
supposed to handle memmove(), but today the memmove() has its own
implementation.
Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/string.h | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 3c941289d5dd0985..b0d7d8e143f61741 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -41,16 +41,6 @@ void *_nolibc_memcpy_up(void *dst, const void *src, size_t len)
return dst;
}
-static __attribute__((unused))
-void *_nolibc_memcpy_down(void *dst, const void *src, size_t len)
-{
- while (len) {
- len--;
- ((char *)dst)[len] = ((const char *)src)[len];
- }
- return dst;
-}
-
#ifndef NOLIBC_ARCH_HAS_MEMMOVE
/* might be ignored by the compiler without -ffreestanding, then found as
* missing.
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 4/5] tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function
2023-08-30 13:57 ` [RFC PATCH v1 4/5] tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function Ammar Faizi
@ 2023-08-30 21:27 ` Willy Tarreau
0 siblings, 0 replies; 29+ messages in thread
From: Willy Tarreau @ 2023-08-30 21:27 UTC (permalink / raw)
To: Ammar Faizi
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Wed, Aug 30, 2023 at 08:57:25PM +0700, Ammar Faizi wrote:
> This nolibc internal function is not used. Delete it. It was probably
> supposed to handle memmove(), but today the memmove() has its own
> implementation.
Yeah good catch for these two, I'm pretty sure as well they were used
for memmove().
Thanks!
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH v1 5/5] tools/nolibc: string: Remove the `_nolibc_memcpy_up()` function
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
` (3 preceding siblings ...)
2023-08-30 13:57 ` [RFC PATCH v1 4/5] tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function Ammar Faizi
@ 2023-08-30 13:57 ` Ammar Faizi
2023-08-30 21:32 ` [RFC PATCH v1 0/5] nolibc x86-64 string functions Willy Tarreau
2023-09-01 11:34 ` David Laight
6 siblings, 0 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-08-30 13:57 UTC (permalink / raw)
To: Willy Tarreau, Thomas Weißschuh
Cc: Ammar Faizi, Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
This function is only called by memcpy(), there is no real reason to
have this wrapper. Delete this function and move the code to memcpy()
directly.
Signed-off-by: Ammar Faizi <[email protected]>
---
tools/include/nolibc/string.h | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index b0d7d8e143f61741..920a0a6c922dbc42 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -29,18 +29,6 @@ int memcmp(const void *s1, const void *s2, size_t n)
}
#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCMP */
-static __attribute__((unused))
-void *_nolibc_memcpy_up(void *dst, const void *src, size_t len)
-{
- size_t pos = 0;
-
- while (pos < len) {
- ((char *)dst)[pos] = ((const char *)src)[pos];
- pos++;
- }
- return dst;
-}
-
#ifndef NOLIBC_ARCH_HAS_MEMMOVE
/* might be ignored by the compiler without -ffreestanding, then found as
* missing.
@@ -72,7 +60,13 @@ void *memmove(void *dst, const void *src, size_t len)
__attribute__((weak,unused,section(".text.nolibc_memcpy")))
void *memcpy(void *dst, const void *src, size_t len)
{
- return _nolibc_memcpy_up(dst, src, len);
+ size_t pos = 0;
+
+ while (pos < len) {
+ ((char *)dst)[pos] = ((const char *)src)[pos];
+ pos++;
+ }
+ return dst;
}
#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCPY */
--
Ammar Faizi
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
` (4 preceding siblings ...)
2023-08-30 13:57 ` [RFC PATCH v1 5/5] tools/nolibc: string: Remove the `_nolibc_memcpy_up()` function Ammar Faizi
@ 2023-08-30 21:32 ` Willy Tarreau
2023-09-01 11:34 ` David Laight
6 siblings, 0 replies; 29+ messages in thread
From: Willy Tarreau @ 2023-08-30 21:32 UTC (permalink / raw)
To: Ammar Faizi
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
Hi Ammar,
On Wed, Aug 30, 2023 at 08:57:21PM +0700, Ammar Faizi wrote:
> Hi Willy,
>
> This is an RFC patchset for nolibc x86-64 string functions. There are 5
> patches in this series.
>
> ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for:
> - memcpy() and memmove()
> - memset()
> - memcmp()
> respectively. They can simplify the generated ASM code.
>
> Patch 4 and 5 are not related, just a small cleanup.
So overall I'm fine with this, I think it's reasonable. As you said
we're not trying to chase the very last byte, but for such functions
it's also nice if they can remain small. Some of them might even
benefit from being inlined by the way (in this case they'd rather
move to C functions with an asm() statement), because the call
instruction and the register moves or spilling code will generally
be larger than the functions themselves. That might be worth checking.
Ah no, we cannot because some of them are called from libgcc and friends.
Or we may need to mark them inline and weak without static, I'm not sure
how well that works.
Please just let me know if you intend to change a few things based on
previous comments, and also this memcmp() stuff that's both C and asm.
Thanks!
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-08-30 13:57 [RFC PATCH v1 0/5] nolibc x86-64 string functions Ammar Faizi
` (5 preceding siblings ...)
2023-08-30 21:32 ` [RFC PATCH v1 0/5] nolibc x86-64 string functions Willy Tarreau
@ 2023-09-01 11:34 ` David Laight
2023-09-01 11:46 ` Willy Tarreau
6 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2023-09-01 11:34 UTC (permalink / raw)
To: 'Ammar Faizi', Willy Tarreau, Thomas Weißschuh
Cc: Nicholas Rosenberg, Alviro Iskandar Setiawan,
Michael William Jonathan, GNU/Weeb Mailing List,
Linux Kernel Mailing List
From: Ammar Faizi <[email protected]>
> Sent: 30 August 2023 14:57
>
> This is an RFC patchset for nolibc x86-64 string functions. There are 5
> patches in this series.
>
> ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for:
> - memcpy() and memmove()
> - memset()
> - memcmp()
> respectively. They can simplify the generated ASM code.
>
...
> After this series:
> ```
> 000000000000140a <memmove>:
> 140a: 48 89 f8 mov %rdi,%rax
> 140d: 48 89 d1 mov %rdx,%rcx
> 1410: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi
> 1415: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi
> 141a: fd std
> 141b: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
> 141d: fc cld
> 141e: c3 ret
Isn't that completely broken?
You need to select between forwards and backwards moves.
Since forwards moves are preferred it is best to do
if (dst - src < len)
backards_copy()
else
formwards_copy()
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 11:34 ` David Laight
@ 2023-09-01 11:46 ` Willy Tarreau
2023-09-01 13:06 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: Willy Tarreau @ 2023-09-01 11:46 UTC (permalink / raw)
To: David Laight
Cc: 'Ammar Faizi', Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 11:34:18AM +0000, David Laight wrote:
> From: Ammar Faizi <[email protected]>
> > Sent: 30 August 2023 14:57
> >
> > This is an RFC patchset for nolibc x86-64 string functions. There are 5
> > patches in this series.
> >
> > ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for:
> > - memcpy() and memmove()
> > - memset()
> > - memcmp()
> > respectively. They can simplify the generated ASM code.
> >
> ...
> > After this series:
> > ```
> > 000000000000140a <memmove>:
> > 140a: 48 89 f8 mov %rdi,%rax
> > 140d: 48 89 d1 mov %rdx,%rcx
> > 1410: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi
> > 1415: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi
> > 141a: fd std
> > 141b: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
> > 141d: fc cld
> > 141e: c3 ret
>
> Isn't that completely broken?
>
> You need to select between forwards and backwards moves.
> Since forwards moves are preferred it is best to do
> if (dst - src < len)
> backards_copy()
> else
> formwards_copy()
>
> David
You're completely right indeed, reminds me about the copy_up/copy_down
that were not used anymore :-)
Willy
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 11:46 ` Willy Tarreau
@ 2023-09-01 13:06 ` Ammar Faizi
2023-09-01 14:23 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-09-01 13:06 UTC (permalink / raw)
To: Willy Tarreau
Cc: David Laight, Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 01:46:44PM +0200, Willy Tarreau wrote:
> On Fri, Sep 01, 2023 at 11:34:18AM +0000, David Laight wrote:
> > Isn't that completely broken?
> >
> > You need to select between forwards and backwards moves.
> > Since forwards moves are preferred it is best to do
> > if (dst - src < len)
> > backards_copy()
> > else
> > formwards_copy()
> >
> > David
>
> You're completely right indeed, reminds me about the copy_up/copy_down
> that were not used anymore :-)
I'm an idiot, will fix that. Another attempt as suggested below:
__asm__ (
".section .text.nolibc_memmove\n"
".weak memmove\n"
"memmove:\n"
" movq %rdx, %rcx\n"
" movq %rdi, %rdx\n"
" movq %rdi, %rax\n"
" subq %rsi, %rdx\n"
" cmpq %rcx, %rdx\n"
" jnb .Lforward_copy\n"
" leaq -1(%rdi, %rcx, 1), %rdi\n"
" leaq -1(%rsi, %rcx, 1), %rsi\n"
" std\n"
".Lforward_copy:\n"
" rep movsb\n"
" cld\n"
" ret\n"
);
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 13:06 ` Ammar Faizi
@ 2023-09-01 14:23 ` David Laight
2023-09-01 14:41 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2023-09-01 14:23 UTC (permalink / raw)
To: 'Ammar Faizi', Willy Tarreau
Cc: Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
From: Ammar Faizi
> Sent: 01 September 2023 14:06
...
> > You're completely right indeed, reminds me about the copy_up/copy_down
> > that were not used anymore :-)
>
> I'm an idiot, will fix that. Another attempt as suggested below:
>
> __asm__ (
> ".section .text.nolibc_memmove\n"
> ".weak memmove\n"
> "memmove:\n"
> " movq %rdx, %rcx\n"
> " movq %rdi, %rdx\n"
> " movq %rdi, %rax\n"
You seem to have confused yourself about whether you are using %eax or %edx.
> " subq %rsi, %rdx\n"
> " cmpq %rcx, %rdx\n"
> " jnb .Lforward_copy\n"
I think I'd fall through to the forwards copy
and not worry about replicating the 'reps movsb' and 'ret'.
IIRC 'cld' can be slow as well.
> " leaq -1(%rdi, %rcx, 1), %rdi\n"
> " leaq -1(%rsi, %rcx, 1), %rsi\n"
> " std\n"
> ".Lforward_copy:\n"
> " rep movsb\n"
> " cld\n"
> " ret\n"
> );
>
> --
> Ammar Faizi
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 14:23 ` David Laight
@ 2023-09-01 14:41 ` Ammar Faizi
2023-09-01 14:54 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: Ammar Faizi @ 2023-09-01 14:41 UTC (permalink / raw)
To: David Laight
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 02:23:28PM +0000, David Laight wrote:
> From: Ammar Faizi
> > Sent: 01 September 2023 14:06
...
> > __asm__ (
> > ".section .text.nolibc_memmove\n"
> > ".weak memmove\n"
> > "memmove:\n"
> > " movq %rdx, %rcx\n"
> > " movq %rdi, %rdx\n"
> > " movq %rdi, %rax\n"
>
> You seem to have confused yourself about whether you are using %eax or %edx.
What do you mean? They're all 64-bit pointers.
What I know is that the %rdx will be clobbered by "subq %rsi, %rdx"
below and the %rax should be return value. That's why I copy the %rdi
twice. memmove() returns the dst pointer. Did I miss something?
> > " subq %rsi, %rdx\n"
> > " cmpq %rcx, %rdx\n"
> > " jnb .Lforward_copy\n"
>
> I think I'd fall through to the forwards copy
> and not worry about replicating the 'reps movsb' and 'ret'.
> IIRC 'cld' can be slow as well.
Alright, I will avoid cld for the forward copy.
> > " leaq -1(%rdi, %rcx, 1), %rdi\n"
> > " leaq -1(%rsi, %rcx, 1), %rsi\n"
> > " std\n"
> > ".Lforward_copy:\n"
> > " rep movsb\n"
> > " cld\n"
> > " ret\n"
> > );
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 14:41 ` Ammar Faizi
@ 2023-09-01 14:54 ` David Laight
2023-09-01 15:20 ` Ammar Faizi
0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2023-09-01 14:54 UTC (permalink / raw)
To: 'Ammar Faizi'
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
From: Ammar Faizi
> Sent: 01 September 2023 15:42
...
> > > " movq %rdx, %rcx\n"
> > > " movq %rdi, %rdx\n"
> > > " movq %rdi, %rax\n"
> >
> > You seem to have confused yourself about whether you are using %eax or %edx.
>
> What do you mean? They're all 64-bit pointers.
%ax, %eax, %rax - what is the difference :-)
> What I know is that the %rdx will be clobbered by "subq %rsi, %rdx"
> below and the %rax should be return value. That's why I copy the %rdi
> twice. memmove() returns the dst pointer. Did I miss something?
I'd forgotten about the (stupid) return value.
I'm pretty sure it is an accident from the original pdp-11
implementation from the days before C had an explicit 'return'
statement.
(The pdp-11 I used ran RSX/11M - so had a Fortran compiler
not a C one.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v1 0/5] nolibc x86-64 string functions
2023-09-01 14:54 ` David Laight
@ 2023-09-01 15:20 ` Ammar Faizi
0 siblings, 0 replies; 29+ messages in thread
From: Ammar Faizi @ 2023-09-01 15:20 UTC (permalink / raw)
To: David Laight
Cc: Willy Tarreau, Thomas Weißschuh, Nicholas Rosenberg,
Alviro Iskandar Setiawan, Michael William Jonathan,
GNU/Weeb Mailing List, Linux Kernel Mailing List
On Fri, Sep 01, 2023 at 02:54:56PM +0000, David Laight wrote:
> I'd forgotten about the (stupid) return value.
>
> I'm pretty sure it is an accident from the original pdp-11
> implementation from the days before C had an explicit 'return'
> statement.
> (The pdp-11 I used ran RSX/11M - so had a Fortran compiler
> not a C one.)
You're old. I did not exist in that era. And my parents were still young :-)
--
Ammar Faizi
^ permalink raw reply [flat|nested] 29+ messages in thread