History log of /haiku/src/system/libroot/posix/string/arch/x86_64/arch_string.cpp
Revision Date Author Comments
# 43005c19 24-May-2019 Augustin Cavalier <waddlesplash@gmail.com>

libroot: Include emmintrin instead of x86intrin.

We only need one builtin here, not 8 files of them.


# 8088f452 03-Jun-2018 Jérôme Duval <jerome.duval@gmail.com>

efi: fix loader build.


# 4582b6e3 10-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

libroot/x86_64: new memcpy implementation

This patch introduces new memcpy() implementation that improves the
performance when the buffer is small. It was written for processors that
support ERMSB, but performs reasonably well on older CPUs as well.

The following benchmarks were done on Haswell i7 running Debian Jessie
with Linux 3.16.1. In each iteration 64MB buffer was copied, the
parameter "size" is the size of the buffer passed in a single call (i.e.
for "size: 2" memcpy() was called ~32 million times to copy the whole
64MB).

f - original implementation, g - new implementation, all buffers 16 byte
aligned

cpy, size: 8, f: 79971 µs, g: 20419 µs, ∆: 74.47%
cpy, size: 32, f: 42068 µs, g: 12159 µs, ∆: 71.10%
cpy, size: 128, f: 13408 µs, g: 10359 µs, ∆: 22.74%
cpy, size: 512, f: 10634 µs, g: 10433 µs, ∆: 1.89%
cpy, size: 1024, f: 10474 µs, g: 10536 µs, ∆: -0.59%
cpy, size: 4096, f: 9419 µs, g: 8630 µs, ∆: 8.38%

f - glibc 2.19 implementation, g - new implementation, all buffers 16 byte
aligned

cpy, size: 8, f: 26299 µs, g: 20919 µs, ∆: 20.46%
cpy, size: 32, f: 11146 µs, g: 12159 µs, ∆: -9.09%
cpy, size: 128, f: 10778 µs, g: 10354 µs, ∆: 3.93%
cpy, size: 512, f: 12291 µs, g: 10426 µs, ∆: 15.17%
cpy, size: 1024, f: 13923 µs, g: 10571 µs, ∆: 24.08%
cpy, size: 4096, f: 11770 µs, g: 8671 µs, ∆: 26.33%

f - glibc 2.19 implementation, g - new implementation, all buffers unaligned

cpy, size: 16, f: 13376 µs, g: 13009 µs, ∆: 2.74%
cpy, size: 32, f: 11130 µs, g: 12171 µs, ∆: -9.35%
cpy, size: 64, f: 11017 µs, g: 11231 µs, ∆: -1.94%
cpy, size: 128, f: 10884 µs, g: 10407 µs, ∆: 4.38%
cpy, size: 256, f: 10826 µs, g: 10106 µs, ∆: 6.65%
cpy, size: 512, f: 12354 µs, g: 10396 µs, ∆: 15.85%

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>


# 1d7b716f 07-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

libroot/x86_64: new memset implementation

This patch introduces new memset() implementation that improves the
performance when the buffer is small. It was written for processors that
support ERMSB, but performs reasonably well on older CPUs as well.

The following benchmarks were done on Haswell i7 running Debian Jessie
with Linux 3.16.1. In each iteration 64MB buffer was memset()ed, the
parameter "size" is the size of the buffer passed in a single call (i.e.
for "size: 2" memset() was called ~32 million times to memset the whole
64MB).

f - original implementation, g - new implementation, all buffers 16 byte
aligned

set, size: 8, f: 66885 µs, g: 17768 µs, ∆: 73.44%
set, size: 32, f: 17123 µs, g: 9163 µs, ∆: 46.49%
set, size: 128, f: 6677 µs, g: 6919 µs, ∆: -3.62%
set, size: 512, f: 11656 µs, g: 7715 µs, ∆: 33.81%
set, size: 1024, f: 9156 µs, g: 7359 µs, ∆: 19.63%
set, size: 4096, f: 4936 µs, g: 5159 µs, ∆: -4.52%

f - glibc 2.19 implementation, g - new implementation, all buffers 16 byte
aligned

set, size: 8, f: 19631 µs, g: 17828 µs, ∆: 9.18%
set, size: 32, f: 8545 µs, g: 9047 µs, ∆: -5.87%
set, size: 128, f: 8304 µs, g: 6874 µs, ∆: 17.22%
set, size: 512, f: 7373 µs, g: 7486 µs, ∆: -1.53%
set, size: 1024, f: 9007 µs, g: 7344 µs, ∆: 18.46%
set, size: 4096, f: 8169 µs, g: 5146 µs, ∆: 37.01%

Apparently, glibc uses SSE even for large buffers and therefore does not
takes advantage of ERMSB:

set, size: 16384, f: 7007 µs, g: 3223 µs, ∆: 54.00%
set, size: 32768, f: 6979 µs, g: 2930 µs, ∆: 58.02%
set, size: 65536, f: 6907 µs, g: 2826 µs, ∆: 59.08%
set, size: 131072, f: 6919 µs, g: 2752 µs, ∆: 60.23%

The new implementation handles unaligned buffers quite well:

f - glibc 2.19 implementation, g - new implementation, all buffers unaligned

set, size: 16, f: 10045 µs, g: 10498 µs, ∆: -4.51%
set, size: 32, f: 8590 µs, g: 9358 µs, ∆: -8.94%
set, size: 64, f: 8618 µs, g: 8585 µs, ∆: 0.38%
set, size: 128, f: 8393 µs, g: 6893 µs, ∆: 17.87%
set, size: 256, f: 8042 µs, g: 7621 µs, ∆: 5.24%
set, size: 512, f: 9661 µs, g: 7738 µs, ∆: 19.90%

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>


# f2f91078 06-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

kernel/x86_64: remove memset and memcpy from commpage

There is absolutely no reason for these functions to be in commpage,
they don't do anything that involves the kernel in any way.

Additionaly, this patch rewrites memset and memcpy to C++, current
implementation is quite simple (though it may perform surprisingly
well when dealing with large buffers on cpus with ermsb). Better
versions are coming soon.

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>


# 4582b6e3a3c960362d4f9f2f31b7c43887226fa2 10-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

libroot/x86_64: new memcpy implementation

This patch introduces new memcpy() implementation that improves the
performance when the buffer is small. It was written for processors that
support ERMSB, but performs reasonably well on older CPUs as well.

The following benchmarks were done on Haswell i7 running Debian Jessie
with Linux 3.16.1. In each iteration 64MB buffer was copied, the
parameter "size" is the size of the buffer passed in a single call (i.e.
for "size: 2" memcpy() was called ~32 million times to copy the whole
64MB).

f - original implementation, g - new implementation, all buffers 16 byte
aligned

cpy, size: 8, f: 79971 µs, g: 20419 µs, ∆: 74.47%
cpy, size: 32, f: 42068 µs, g: 12159 µs, ∆: 71.10%
cpy, size: 128, f: 13408 µs, g: 10359 µs, ∆: 22.74%
cpy, size: 512, f: 10634 µs, g: 10433 µs, ∆: 1.89%
cpy, size: 1024, f: 10474 µs, g: 10536 µs, ∆: -0.59%
cpy, size: 4096, f: 9419 µs, g: 8630 µs, ∆: 8.38%

f - glibc 2.19 implementation, g - new implementation, all buffers 16 byte
aligned

cpy, size: 8, f: 26299 µs, g: 20919 µs, ∆: 20.46%
cpy, size: 32, f: 11146 µs, g: 12159 µs, ∆: -9.09%
cpy, size: 128, f: 10778 µs, g: 10354 µs, ∆: 3.93%
cpy, size: 512, f: 12291 µs, g: 10426 µs, ∆: 15.17%
cpy, size: 1024, f: 13923 µs, g: 10571 µs, ∆: 24.08%
cpy, size: 4096, f: 11770 µs, g: 8671 µs, ∆: 26.33%

f - glibc 2.19 implementation, g - new implementation, all buffers unaligned

cpy, size: 16, f: 13376 µs, g: 13009 µs, ∆: 2.74%
cpy, size: 32, f: 11130 µs, g: 12171 µs, ∆: -9.35%
cpy, size: 64, f: 11017 µs, g: 11231 µs, ∆: -1.94%
cpy, size: 128, f: 10884 µs, g: 10407 µs, ∆: 4.38%
cpy, size: 256, f: 10826 µs, g: 10106 µs, ∆: 6.65%
cpy, size: 512, f: 12354 µs, g: 10396 µs, ∆: 15.85%

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>


# 1d7b716f84b6cbed439a33aa4df78f3b0dfc279b 07-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

libroot/x86_64: new memset implementation

This patch introduces new memset() implementation that improves the
performance when the buffer is small. It was written for processors that
support ERMSB, but performs reasonably well on older CPUs as well.

The following benchmarks were done on Haswell i7 running Debian Jessie
with Linux 3.16.1. In each iteration 64MB buffer was memset()ed, the
parameter "size" is the size of the buffer passed in a single call (i.e.
for "size: 2" memset() was called ~32 million times to memset the whole
64MB).

f - original implementation, g - new implementation, all buffers 16 byte
aligned

set, size: 8, f: 66885 µs, g: 17768 µs, ∆: 73.44%
set, size: 32, f: 17123 µs, g: 9163 µs, ∆: 46.49%
set, size: 128, f: 6677 µs, g: 6919 µs, ∆: -3.62%
set, size: 512, f: 11656 µs, g: 7715 µs, ∆: 33.81%
set, size: 1024, f: 9156 µs, g: 7359 µs, ∆: 19.63%
set, size: 4096, f: 4936 µs, g: 5159 µs, ∆: -4.52%

f - glibc 2.19 implementation, g - new implementation, all buffers 16 byte
aligned

set, size: 8, f: 19631 µs, g: 17828 µs, ∆: 9.18%
set, size: 32, f: 8545 µs, g: 9047 µs, ∆: -5.87%
set, size: 128, f: 8304 µs, g: 6874 µs, ∆: 17.22%
set, size: 512, f: 7373 µs, g: 7486 µs, ∆: -1.53%
set, size: 1024, f: 9007 µs, g: 7344 µs, ∆: 18.46%
set, size: 4096, f: 8169 µs, g: 5146 µs, ∆: 37.01%

Apparently, glibc uses SSE even for large buffers and therefore does not
takes advantage of ERMSB:

set, size: 16384, f: 7007 µs, g: 3223 µs, ∆: 54.00%
set, size: 32768, f: 6979 µs, g: 2930 µs, ∆: 58.02%
set, size: 65536, f: 6907 µs, g: 2826 µs, ∆: 59.08%
set, size: 131072, f: 6919 µs, g: 2752 µs, ∆: 60.23%

The new implementation handles unaligned buffers quite well:

f - glibc 2.19 implementation, g - new implementation, all buffers unaligned

set, size: 16, f: 10045 µs, g: 10498 µs, ∆: -4.51%
set, size: 32, f: 8590 µs, g: 9358 µs, ∆: -8.94%
set, size: 64, f: 8618 µs, g: 8585 µs, ∆: 0.38%
set, size: 128, f: 8393 µs, g: 6893 µs, ∆: 17.87%
set, size: 256, f: 8042 µs, g: 7621 µs, ∆: 5.24%
set, size: 512, f: 9661 µs, g: 7738 µs, ∆: 19.90%

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>


# f2f91078bdfb4cc008c2f87af2bcc4aedec85cbc 06-Sep-2014 Paweł Dziepak <pdziepak@quarnos.org>

kernel/x86_64: remove memset and memcpy from commpage

There is absolutely no reason for these functions to be in commpage,
they don't do anything that involves the kernel in any way.

Additionaly, this patch rewrites memset and memcpy to C++, current
implementation is quite simple (though it may perform surprisingly
well when dealing with large buffers on cpus with ermsb). Better
versions are coming soon.

Signed-off-by: Paweł Dziepak <pdziepak@quarnos.org>