Simplify the alignment steps for SZREG and BLOCK_SIZE multiples. The previous
three-instruction sequences
addi a7, a2, -SZREG
andi a7, a7, -SZREG
addi a7, a7, SZREG
and
addi a7, a2, -BLOCK_SIZE
andi a7, a7, -BLOCK_SIZE
addi a7, a7, BLOCK_SIZE
are equivalent to a single
andi a7, a2, -SZREG
andi a7, a2, -BLOCK_SIZE
because SZREG and BLOCK_SIZE are powers of two in this context, making the
surrounding addi steps cancel out. Folding to one instruction reduces code
size with identical semantics.
No functional change.
sysdeps/riscv/multiarch/memcpy_noalignment.S: Remove redundant addi around
alignment; keep a single andi for SZREG/BLOCK_SIZE rounding.
Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
Tidy the temporary register allocation to favor registers eligible for
compressed encodings when Zca/Zcb are enabled. This keeps the ABI and
clobber set unchanged and does not alter control flow or memory access
behavior.
No functional change.
sysdeps/riscv/multiarch/memcpy_noalignment.S: Reassign temps to improve
compressed encoding opportunities.
Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
It improves latency for about 3-10% and throughput for about 5-15%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 1-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-7% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 2% and throughput for about 5%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 2-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The m68k provided an optimized version through __m81_u(fmod)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The m68k provided an optimized version through __m81_u(fmodf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one.
The performance on a Zen3 chip is slight better:
reciprocal-throughput input master no-SVID improvement
i686 subnormals 22.4741 20.1571 10.31%
i686 normal 74.1631 70.3606 5.13%
i686 close-exponent 22.5625 20.2435 10.28%
Tested on i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one. The performance on a Zen3 chip is
similar to the SVID one.
Tested on i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The ifunc selector for wmemset had a stray '!' in the
X86_ISA_CPU_FEATURES_ARCH_P(...) check:
if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)
&& X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
AVX_Fast_Unaligned_Load, !))
This effectively negated the predicate and caused the AVX2/AVX512
paths to be skipped, making the dispatcher fall back to the SSE2
implementation even on CPUs where AVX2/AVX512 are available. The
regression leads to noticeable throughput loss for wmemset.
Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is
tested as intended and the correct AVX2/EVEX variants are selected.
Impact:
- On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly
falls back to SSE2; perf now shows __wmemset_evex/avx2 variants.
Testing:
- benchtests/bench-wmemset shows improved bandwidth across sizes.
- perf confirm the selected symbol is no longer SSE2.
Signed-off-by: xiejiamei <xiejiamei@hygon.com>
Signed-off-by: Li jing <lijing@hygon.cn>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It issues:
../sysdeps/aarch64/tst-ifunc-arg-4.c:39:1: error: unused function 'resolver' [-Werror,-Wunused-function]
39 | resolver (uint64_t arg0, const uint64_t arg1[])
| ^~~~~~~~
1 error generated.
clang-19 and onwards do not trigger the warning.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Recent lld version default to --no-undefined-version, which triggers
errors when building multiple libraries. For ld.so on x86_64 it fails
with:
ld.lld: error: version script assignment of 'GLIBC_2.4' to symbol '__stack_chk_guard' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '__nptl_set_robust_list_avail' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '__pointer_chk_guard' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '_dl_starting_up' failed: symbol not defined
While for libc.so:
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_clearerr' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fgetc' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fileno' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_freopen' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fscanf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fseek' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_peekc_unlocked' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stderr_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stdin_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stdout_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_pclose' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_perror' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_rewind' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_scanf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_setlinebuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_wdefault_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_wfile_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '__ctype32_tolower' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '__ctype32_toupper' failed: symbol not defined
ld.lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
The version script is created with multiple missing symbols to simplify
the build for multiple ABIs, each of which may have different symbols.
For instance, __stack_chk_guard is defined by default. This avoids
requiring each ABI to add this symbol to its version script, depending
on the stack protector ABI it uses.
The libc.so warnings do show unused symbols being defined (like
_IO_clearerr), which might trigger potential errors depending on how
symbols are exported. However, since we already have ABI checks for
missing and extra symbols, the linker's extra checks are not really
necessary.
The --no-undefined-version is the default for ld.bfd.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
clang 20 issues an warning for the unused '-c' argument used to create
errlist-data-aux-shared.S, errlist-data-aux.S, siglist-aux-shared.S,
and siglist-aux.S. Filter out the '-c' from the $(compile-command.c).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
The internal header redefines the some internal argp functions with
attribute_hidden, which triggers clang warning of mismatched attributes.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
The argp code uses macro redefinitions to avoid duplicating static inline
implementations for argp_usage, _option_is_short, and _option_is_end.
However, this causes build issues with clang, as some function prototypes
are redefined to add the hidden attribute with libc_hidden_proto.
To avoid extensive changes to internal headers, just expand the function
implementations and avoid the macro redefine tricks.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
Programs in $(others-noinstall) are internal to glibc build and they
aren't installed. They should be treated like programs in $(others),
but linked like tests so that --enable-hardcoded-path-in-tests also
applies to them.
Also replace run-via-rtld-prefix with test-via-rtld-prefix when running
container tests.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>
The setrlimit(2) function returns 0 on success and -1 on error, but
several test files were incorrectly checking for a return value of 1
to detect errors. This means the error checks would never trigger,
causing tests to continue silently even when setrlimit() failed.
This commit fixes the error checks in five files to correctly test
for -1, matching both the documented behavior and the pattern used
correctly in other parts of the codebase.
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
The C2y function uimaxabs has been renamed to umaxabs. Implement this
change in glibc, keeping a compat symbol under the old name, copying
the test to test the new name and changing the old test to test the
compat symbol. Jakub has done the corresponding change to the
built-in function in GCC.
Tested for x86_64 and x86.
For lgamma and tgamma the muldd, mulddd, and polydd are renamed
to muldd2, mulddd2, and polydd2 respectively.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The common code definitions are consolidated in s_erf_common.h
and s_erf_common.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The shared internal data definitions are consolidated in
s_erf_data.c and the erfc only one are moved to s_erfc_data.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MIN, -4.2]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-4.2, 4.2]
* FE_TONEAREST
0: 9764404513 97.64%
1: 235595487 2.36%
* FE_UPWARD
0: 9468013928 94.68%
1: 531986072 5.32%
* FE_DOWNWARD
0: 9493787693 94.94%
1: 506212307 5.06%
* FE_TOWARDZERO
0: 9585271351 95.85%
1: 414728649 4.15%
* Range [4.2, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 38.2754 78.0311 -103.87%
x86_64v2 38.3325 75.7555 -97.63%
x86_64v3 34.6604 28.3182 18.30%
aarch64 23.1499 21.4307 7.43%
power10 12.3051 9.3766 23.80%
Latency master patched improvement
x86_64 84.3062 121.3580 -43.95%
x86_64v2 84.1817 117.4250 -39.49%
x86_64v3 81.0933 70.6458 12.88%
aarch64 35.012 29.5012 15.74%
power10 21.7205 18.4589 15.02%
For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The internal data definitions are moved to s_atanh_data.c.
It helps on ABIs that build the implementation multiple times for
ifunc optimizations, like x86_64.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):
* Range [-1, 1]
* FE_TONEAREST
0: 8180011860 81.80%
1: 1819865257 18.20%
2: 122883 0.00%
* FE_UPWARDA
0: 3903695744 39.04%
1: 4992324465 49.92%
2: 1096319340 10.96%
3: 7660451 0.08%
* FE_DOWNWARDA
0: 3904555484 39.05%
1: 4991970864 49.92%
2: 1095447471 10.95%
3: 8026181 0.08%
* FE_TOWARDZERO
0: 7070209165 70.70%
1: 2908447434 29.08%
2: 21343401 0.21%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 26.4969 22.4625 15.23%
x86_64v2 26.0792 22.9822 11.88%
x86_64v3 25.6357 22.2147 13.34%
aarch64 20.2295 19.7001 2.62%
power10 10.0986 9.3846 7.07%
Latency master patched improvement
x86_64 80.2311 59.9745 25.25%
x86_64v2 79.7010 61.4066 22.95%
x86_64v3 78.2679 58.5804 25.15%
aarch64 34.3959 28.1523 18.15%
power10 23.2417 18.2694 21.39%
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
Since SSIZE_MAX is less than UINT_MAX on 32-bit platforms we must AND
the expression with SSIZE_MAX.
Tested on x86_64 and x86.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
- Performance testing revealed significant memcpy performance degradation
when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3.
- Hygon confirmed AVX performance issues in certain memory functions.
- Glibc benchmarks show SSE outperforms AVX for
memcpy/memmove/memset/strcmp/strcpy/strlen and so on.
- Hardware differences primarily in floating-point operations don't justify
AVX usage for memory operations.
Reviewed-by: gaoxiang <gaoxiang@kylinos.cn>
Signed-off-by: litenglong <litenglong@kylinos.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The tcache is used for allocation only if an exact match is found. In the
large tcache code added in commit cbfd798810, we currently extract a
chunk of size greater than or equal to the size we need, but don't check
strict equality. This patch fixes that behaviour.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
clang issues:
syslog.c:193:9: error: adding 'int' to a string does not append to the string [-Werror,-Wstring-plus-int]
193 | SYSLOG_HEADER (pri, timestamp, &msgoff, pid));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
syslog.c:180:7: note: expanded from macro 'SYSLOG_HEADER'
180 | "[" + (pid == 0), pid, "]" + (pid == 0)
Use array indexes instead of string addition (it is simpler than
add a supress warning).