glibc/sysdeps
Noah Goldstein 12fec8aae5 x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]
The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 483443d321)
2025-01-09 17:23:28 -08:00
..
aarch64 AArch64: Check kernel version for SVE ifuncs 2024-04-10 14:03:08 +01:00
alpha configure: Use autoconf 2.71 2023-07-17 10:08:10 -04:00
arc login: Check default sizes of structs utmp, utmpx, lastlog 2024-04-19 18:38:23 +02:00
arm login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
csky login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
generic elf: Fix slow tls access after dlopen [BZ #19924] 2025-01-09 07:31:25 -08:00
gnu configure: Use autoconf 2.71 2023-07-17 10:08:10 -04:00
hppa login: Check default sizes of structs utmp, utmpx, lastlog 2024-04-19 18:38:23 +02:00
htl
hurd
i386 i386: ulp update for SSE2 --disable-multi-arch configurations 2024-04-25 13:07:19 +02:00
ia64 configure: Use autoconf 2.71 2023-07-17 10:08:10 -04:00
ieee754 x86_64: Add log1p with FMA 2025-01-09 07:31:04 -08:00
loongarch LoongArch: Correct {__ieee754, _}_scalb -> {__ieee754, _}_scalbf 2024-03-22 09:25:39 +08:00
m68k login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
mach malloc: Use __get_nprocs on arena_get2 (BZ 30945) 2024-02-12 09:53:27 -03:00
microblaze login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
mips login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
nios2 login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
nptl Linux: Make __rseq_size useful for feature detection (bug 31965) 2024-07-16 16:37:38 +02:00
or1k login: Check default sizes of structs utmp, utmpx, lastlog 2024-04-19 18:38:23 +02:00
posix getaddrinfo: translate ENOMEM to EAI_MEMORY (bug 31163) 2024-01-02 14:37:02 +01:00
powerpc login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
pthread Add crt1-2.0.o for glibc 2.0 compatibility tests 2024-10-01 10:33:51 +08:00
riscv login: Check default sizes of structs utmp, utmpx, lastlog 2024-04-19 18:38:23 +02:00
s390 s390x: Fix segfault in wcsncmp [BZ #31934] 2024-07-16 10:26:44 +02:00
sh login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
sparc login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701) 2024-04-19 18:38:24 +02:00
unix nptl: initialize cpu_id_start prior to rseq registration 2024-12-06 16:01:45 +00:00
wordsize-32
wordsize-64
x86 sysdeps/x86/Makefile: Split and sort tests 2025-01-09 17:23:28 -08:00
x86_64 x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212] 2025-01-09 17:23:28 -08:00