The support for lock elision was already deprecated with glibc 2.42:
commit 77438db8cf
"Mark support for lock elision as deprecated."
See also discussions:
https://sourceware.org/pipermail/libc-alpha/2025-July/168492.html
This patch removes the architecture specific support for lock elision
for x86, powerpc and s390 by removing the elision-conf.h, elision-conf.c,
elision-lock.c, elision-timed.c, elision-unlock.c, elide.h, htm.h/hle.h files.
Those generic files are also removed.
The architecture specific structures are adjusted and the elision fields are
marked as unused. See struct_mutex.h files.
Furthermore in struct_rwlock.h, the leftover __rwelision was also removed.
Those were originally removed with commit 0377a7fde6
"nptl: Remove rwlock elision definitions"
and by chance reintroduced with commit 7df8af43ad
"nptl: Add struct_rwlock.h"
The common code (e.g. the pthread_mutex-files) are changed back to the time
before lock elision was introduced with the x86-support:
- commit 1cdbe57948
"Add the low level infrastructure for pthreads lock elision with TSX"
- commit b023e4ca99
"Add new internal mutex type flags for elision."
- commit 68cc29355f
"Add minimal test suite changes for elision enabled kernels"
- commit e8c659d74e
"Add elision to pthread_mutex_{try,timed,un}lock"
- commit 49186d21ef
"Disable elision for any pthread_mutexattr_settype call"
- commit 1717da59ae
"Add a configure option to enable lock elision and disable by default"
Elision is removed also from the tunables, the initialization part, the
pretty-printers and the manual.
Some extra handling in the testsuite is removed as well as the full tst-mutex10
testcase, which tested a race while enabling lock elision.
I've also searched the code for "elision", "elide", "transaction" and e.g.
cleaned some comments.
I've run the testsuite on x86_64 and s390x and run the build-many-glibcs.py
script.
Thanks to Sachin Monga, this patch is also tested on powerpc.
A NEWS entry also mentions the removal.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The 53807741fb added a configure check
for 64-bit atomic operations that were not previously enabled on some
32-bit ABIs.
However, the NPTL semaphore code casts a sem_t to a new_sem and issues
a 64-bit atomic operation for __HAVE_64B_ATOMICS. Since sem_t has
32-bit alignment on 32-bit architectures, this prevents the use of
64-bit atomics even if the ABI supports them.
Assume 64-bit atomic support from __WORDSIZE, which maps to how glibc
defines it before the broken change. Also rename __HAVE_64B_ATOMICS
to USE_64B_ATOMICS to define better the flag meaning.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Benchmarks indicate evex can be more profitable on Hygon hardware
than AVX512. So add Prefer_No_AVX512 to make it run with evex.
Change-Id: Icc59492f71fde7a783a8bd315714ffd6f7ecaf29
Signed-off-by: Li jing <lijing@hygon.cn>
Signed-off-by: Xie jiamei <xiejiamei@hygon.cn>
clang does not support the %v to select the AVX encoding, nor the '%d' asm
contrain, and for AVX build it requires all 3 arguments.
This patch add a new internal header, math-inline-asm.h, that adds
functions to abstract the inline asm required differences between
gcc and clang.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The only usage was for pthread_spin_lock, introduced by 12d2dd7060,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Now that atomic builtins are used by default, we can rely on the
compiler to define when to use 64-bit atomic operations.
It allows the use of 64-bit atomic operations on some 32-bit ABIs where
they were not previously enabled due to missing pre-processor handling:
hppa, mips64n32, s390, and sparcv9.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
All ABIs, except alpha and sparc, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_RELEASE).
For alpha, it uses a 'wmb' which does not map to any of C11
barriers.
For sparc it uses a stronger 'member #LoadStore | #StoreStore',
where the release barrier maps to just 'membar #StoreLoad'. The
patch keeps the sparc definition.
For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).
Tested on aarch64-linux-gnu.
Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
All ABIs, except alpha, powerpc, and x86_64, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_SEQ_CST) in most cases, with the
exception of aarch64 (where the acquire fence is generated as
'dmb ishld' instead of 'dmb ish').
For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline
synchronization).
For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).
Tested on aarch64-linux-gnu, where the acquire produces a different
instruction that the current code.
Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
All ABIs save for sparcv9 and s390 defines it to __sync_synchronize,
which can be mapped to __atomic_thread_fence (__ATOMIC_SEQ_CST).
For Sparc, it uses a stricter #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
instead of the #StoreLoad generated by __sync_synchronize.
For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline synchronization).
The barrier is used only in one place (pthread_mutex_setprioceiling),
and using a stricter barrier for s390 is ok performance-wise.
Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
These are already provided by the generic include/atomic.h.
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
- Performance testing revealed significant memcpy performance degradation
when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3.
- Hygon confirmed AVX performance issues in certain memory functions.
- Glibc benchmarks show SSE outperforms AVX for
memcpy/memmove/memset/strcmp/strcpy/strlen and so on.
- Hardware differences primarily in floating-point operations don't justify
AVX usage for memory operations.
Reviewed-by: gaoxiang <gaoxiang@kylinos.cn>
Signed-off-by: litenglong <litenglong@kylinos.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
fpu_control.h is an installed header so a wider range of compiler versions
(including ones older than GCC 9) are relevant with it than are relevant
for building glibc.
Fixes commit 3014dec3ad
('x86: Remove obsolete "*&" GCC asm memory operand workaround')
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Legacy encodings of SSE instructions incur AVX-SSE domain transition
penalties on some Intel microarchitectures (e.g. Haswell, Broadwell).
Using the VEX forms avoids these penatlies and keeps all instructions
in the VEX decode domain. Use "%v" sequence to emit the "v" prefix
for opcodes when compiling with -mavx.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
GCC now accept plain variable names as valid lvalues for "m"
constraints, automatically spilling locals to memory if necessary.
The long-standing "*&" pattern was originally used as a defensive
workaround for older compiler versions that rejected operands
such as:
asm ("incl %0" : "+m"(x));
with errors like "memory input is not directly addressable".
Modern compilers (GCC >= 9) reliably generate correct code
without the workaround, and the resulting assembly is identical.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The compiler might not inline the trunc function call for
USE_TRUNC_BUILTIN [1].
This patch adds an optimized __trunc/__truncf for x86 used
on modf ifunc variant to avoid the trunc libcall.
Checked on x86_64, x86_64-v2, x86_64-v3, and x86_64-v4. Used -O2 and
-Os options. Performed a full make check on x86_64 with both
optimizations.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121861
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The x86 version of thread_pointer.h is the same as the generic one.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
GCC 12 is currently the minimum supported compiler version.
Remove no longer needed __GNUC_PREREQ (11, 1) test from
__thread_pointer().
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On x86-64, when glibc is configured with --enable-stack-protector=all
and compiled with -Os, ld.so crashes very early:
(gdb) r --direct
Starting program: /export/build/gnu/tools-build/glibc-gitlab/build-x86_64-linux/string/test-memswap --direct
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f41b0a in bsearch (__key=__key@entry=0x7fffffffda28,
__base=__base@entry=0x7ffff7fca140 <intel_02_known>,
__nmemb=__nmemb@entry=68, __size=__size@entry=8,
__compar=__compar@entry=0x7ffff7f3b691 <intel_02_known_compare>)
at ../bits/stdlib-bsearch.h:22
22 {
(gdb) disass
Dump of assembler code for function bsearch:
0x00007ffff7f41af0 <+0>: push %r15
0x00007ffff7f41af2 <+2>: mov %rcx,%r15
0x00007ffff7f41af5 <+5>: push %r14
0x00007ffff7f41af7 <+7>: push %r13
0x00007ffff7f41af9 <+9>: mov %rsi,%r13
0x00007ffff7f41afc <+12>: push %r12
0x00007ffff7f41afe <+14>: mov %rdi,%r12
0x00007ffff7f41b01 <+17>: push %rbp
0x00007ffff7f41b02 <+18>: mov %rdx,%rbp
0x00007ffff7f41b05 <+21>: push %rbx
0x00007ffff7f41b06 <+22>: sub $0x18,%rsp
=> 0x00007ffff7f41b0a <+26>: mov %fs:0x28,%r14
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can't use stack protector at this point.
0x00007ffff7f41b13 <+35>: mov %r14,0x8(%rsp)
0x00007ffff7f41b18 <+40>: mov %r8,%r14
0x00007ffff7f41b1b <+43>: test %rbp,%rbp
0x00007ffff7f41b1e <+46>: je 0x7ffff7f41b48 <bsearch+88>
0x00007ffff7f41b20 <+48>: mov %rbp,%rbx
0x00007ffff7f41b23 <+51>: mov %r12,%rdi
0x00007ffff7f41b26 <+54>: shr $1,%rbx
0x00007ffff7f41b29 <+57>: imul %r15,%rbx
0x00007ffff7f41b2d <+61>: add %r13,%rbx
0x00007ffff7f41b30 <+64>: mov %rbx,%rsi
(gdb) bt
#0 0x00007ffff7f41b0a in bsearch (__key=__key@entry=0x7fffffffda28,
__base=__base@entry=0x7ffff7fca140 <intel_02_known>,
__nmemb=__nmemb@entry=68, __size=__size@entry=8,
__compar=__compar@entry=0x7ffff7f3b691 <intel_02_known_compare>)
at ../bits/stdlib-bsearch.h:22
#1 0x00007ffff7f3c1be in intel_check_word (name=188, value=1979933440,
has_level_2=has_level_2@entry=0x7fffffffda7f,
no_level_2_or_3=no_level_2_or_3@entry=0x7fffffffda7e,
cpu_features=<optimized out>) at ../sysdeps/x86/dl-cacheinfo.h:217
#2 0x00007ffff7f3c29f in handle_intel (name=name@entry=188,
cpu_features=<optimized out>) at ../sysdeps/x86/dl-cacheinfo.h:279
#3 0x00007ffff7f3ccf9 in dl_init_cacheinfo (cpu_features=<optimized out>)
at ../sysdeps/x86/dl-cacheinfo.h:852
#4 init_cpu_features (cpu_features=<optimized out>)
at ../sysdeps/x86/cpu-features.c:1153
#5 0x00007ffff7f3d6f9 in __libc_start_main_impl (main=0x7ffff7f396dc <main>,
argc=2, argv=0x7fffffffdbe8, init=<optimized out>, fini=<optimized out>,
rtld_fini=0x0, stack_end=0x7fffffffdbd8) at ../csu/libc-start.c:269
#6 0x00007ffff7f39901 in _start () at ../sysdeps/x86_64/start.S:115
(gdb)
The problem is that since __USE_EXTERN_INLINES isn't defined with -Os,
the inline bsearch in <bits/stdlib-bsearch.h> isn't available and the
external bsearch is compiled with stack protector. Include
<bits/stdlib-bsearch.h> in dl-cacheinfo.h fixed BZ #33374.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Use the flag output constraints feature available in gcc 6+
("=@cc<cond>") instead of explicitly setting a boolean variable
with SETcc instruction. This approach decouples the instruction
that sets the flags from the code that consumes them, allowing
the compiler to create better code when working with flags users.
Instead of e.g.:
lock add %esi,(%rdi)
sets %sil
test %sil,%sil
jne <...>
the compiler now generates:
lock add %esi,(%rdi)
js <...>
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
wc -l pads the output with leading spaces on some systems, e.g. FreeBSD.
This results in the check `test "$count" = 1` failing. Use -eq for integer
comparison instead.
Signed-off-by: Henrik Lindström <henrik@lxm.se>
Reviewed-by: Arjun Shankar <arjun@redhat.com>
LIBC_TRY_TEST_CC_OPTION is defined with LIBC_TRY_CC_OPTION:
dnl Test a compiler option or options with an empty input file.
dnl LIBC_TRY_CC_OPTION([options], [action-if-true], [action-if-false])
AC_DEFUN([LIBC_TRY_CC_OPTION],
[AS_IF([AC_TRY_COMMAND([${CC-cc} $1 -xc /dev/null -S -o /dev/null])],
[$2], [$3])])
which passes -S to compiler. Unlike gcc, when -c is also passed to clang
20, we get
configure:7838: clang -c -Werror -fsemantic-interposition -xc /dev/null -S -o /dev/null
clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument]
Don't pass -c to LIBC_TRY_TEST_CC_OPTION since -c isn't needed.
This fixes BZ #33318.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
If the building compiler enables no direct external data access by
default, access to protected data in shared libraries from executables
must be compiled with no direct external data access. If the testing
compiler doesn't support it, set have-protected-data to no to disable
the tests which requires no direct external data access.
Add LIBC_TRY_CC_COMMAND to test a building compiler option or options
with an input file.
This fixes BZ #33286.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
There are 3 variants of modf and modff: SSE2, SSE4.1 and AVX. s_modf.c
and s_modff.c include the generic implementation compiled with the minimum
x86 ISA level. The IFUNC selector is used only if the minimum ISA level
is less than AVX. SSE4.1 variant is included only if the ISA level is
less than SSE4.1. AVX variant is included only the ISA level is less than
AVX.
AVX variant should be compiled with -mavx, not -msse2avx -DSSE2AVX which
are used to encode SSE assembly sources with EVEX encoding.
The routines that are shared between libc and libm should use different
rules to avoid using the same MODULE_NAME, to avoid potential issues
like BZ #33165 where __stack_chk_fail not being routed to the internal
symbol.
Tested with -march=x86-64, -march=x86-64-v2, -march=x86-64-v3 and
-march=x86-64-v4.
This fixes BZ #33165 and BZ #33173.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Since mcount_internal is called from mcount/__fentry__ which preserve
only RAX, RCX, RDX, RSI, RDI, R8 and R9, compile mcount.c with
-fno-tree-loop-distribute-patterns -mgeneral-regs-only -mno-apxf
to void vector/r16-r31 registers and memcpy/memset in mcount_internal.
This fixes BZ #33134.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
Update tst-gnu2-tls2 tests to set XMM0...XMM7 to all 1s in malloc to
verify that XMM registers are preserved when _dl_tlsdesc_dynamic is
called by clearing vectors with zeroed XMM registers before
_dl_tlsdesc_dynamic and using these XMM registers to clear vectors
after _dl_tlsdesc_dynamic. This improves the BZ #31372 test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Compiler generates the following instruction sequence for dynamic TLS
access:
leal tls_var@tlsgd(,%ebx,1), %eax
call ___tls_get_addr@PLT
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL. But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.
1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:
_dl_tlsdesc_dynamic:
/* Like all TLS resolvers, preserve call-clobbered registers.
We need two scratch regs anyway. */
subl $32, %esp
cfi_adjust_cfa_offset (32)
It is wrong to use
movl %ebx, -28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl -28(%esp), %ebx
to preserve EBX on stack. Fix it with:
movl %ebx, 28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl 28(%esp), %ebx
4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.
This fixes BZ #32996.
Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
In init_cpu_features, replace GLRO(dl_x86_cpu_features) with
cpu_features to avoid an extra load.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Enable default tuning for unknown Intel processor.
Tested on x86, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
- Add ARROWLAKE model detection.
- Add PANTHERLAKE model detection.
- Add CLEARWATERFOREST model detection.
Intel® Architecture Instruction Set Extensions Programming Reference
https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.
No regression, validated model detection on SDE.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.
Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
This fixes a test build failure on Hurd.
Fixes commit 145097dff1 ("x86: Use separate
variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)").
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
Fixes commit 9b7091415a ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.
Related to commit befe2d3c4d
("x86-64: Don't use SSE resolvers for ISA level 3 or above").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
powerpc was the only architecture with arch-specific hooks for
LD_SHOW_AUXV, and with the information moved to ld diagnostics there
is no need to keep the _dl_procinfo hook.
Checked with a build for all affected ABIs.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>