Commit Graph

187 Commits

Author SHA1 Message Date
Adhemerval Zanella 7fec8a5de6 Revert __HAVE_64B_ATOMICS configure check
The 53807741fb added a configure check
for 64-bit atomic operations that were not previously enabled on some
32-bit ABIs.

However, the NPTL semaphore code casts a sem_t to a new_sem and issues
a 64-bit atomic operation for __HAVE_64B_ATOMICS.  Since sem_t has
32-bit alignment on 32-bit architectures, this prevents the use of
64-bit atomics even if the ABI supports them.

Assume 64-bit atomic support from __WORDSIZE, which maps to how glibc
defines it before the broken change.  Also rename __HAVE_64B_ATOMICS
to USE_64B_ATOMICS to define better the flag meaning.

Checked on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-14 14:05:20 -03:00
Peter Bergner 47975914fb riscv: Add vector registers to __SYSCALL_CLOBBERS
The Linux kernel ABI specifies that the vector registers are not preserved
across system calls, but the __SYSCALL_CLOBBERS macro doesn't mention them.
This could possibly lead to compilers trying to keep data in the vector
registers across the syscall leading to corruption.  Add the vector registers
to __SYSCALL_CLOBBERS when the vector extension is enabled.  If the vector
extension is enabled, then require GCC 15 or later and RVV 1.0 or later.

Fixes: 36960f0c76 ("RISC-V: Linux Syscall Interface")

Signed-off-by: Peter Bergner <bergner@tenstorrent.com>
2025-11-04 09:18:56 -06:00
Wilco Dijkstra 324c088a18 nptl: Remove ATOMIC_EXCHANGE_USES_CAS usage
The only usage was for pthread_spin_lock, introduced by 12d2dd7060,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Wilco Dijkstra 53807741fb Define __HAVE_64B_ATOMICS from compiler support
Now that atomic builtins are used by default, we can rely on the
compiler to define when to use 64-bit atomic operations.

It allows the use of 64-bit atomic operations on some 32-bit ABIs where
they were not previously enabled due to missing pre-processor handling:
hppa, mips64n32, s390, and sparcv9.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella 70ee250fb8 atomic: Consolidate atomic_full_barrier implementation
All ABIs save for sparcv9 and s390 defines it to __sync_synchronize,
which can be mapped to __atomic_thread_fence (__ATOMIC_SEQ_CST).

For Sparc, it uses a stricter #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
instead of the #StoreLoad generated by __sync_synchronize.

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline synchronization).

The barrier is used only in one place (pthread_mutex_setprioceiling),
and using a stricter barrier for s390 is ok performance-wise.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella 1f5d8663ea riscv: Consolidade atomic-machine.h and remove ununsed atomic macros
The resulting definitions are not Linux specific, so move the header
to generic sysdep folder.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Pincheng Wang 720e891637 riscv: Add Zbkb optimized repeat_bytes helper
Introduce a RISC-V specific string-misc.h to provide an optimized
repeat_bytes implementation when the Zbkb extension is available.
The new version uses packh/packw/pack instruction count and avoiding
high latency instructions. This helper is used by several mem and string
functions, and falls back to the generic implementation when Zbkb is not
present.

Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-10-31 16:23:57 -05:00
Yao Zihong 09a94c86ca riscv: memcpy_noalignment: Reorder to store via a3, then bump a3
Rewrite the copy micro-step from:

    REG_L  a4, 0(a5)
    addi   a3, a3, SZREG
    addi   a5, a5, SZREG
    REG_S  a4, -SZREG(a3)

to:

    REG_L  a4, 0(a5)
    addi   a5, a5, SZREG
    REG_S  a4, 0(a3)
    addi   a3, a3, SZREG

Semantics are unchanged: both read *(a5_old), write *(a3_old), and then
increment a3/a5 by SZREG. memcpy assumes non-overlapping regions, so the
reordering preserves correctness.

No functional change.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-10-30 17:49:21 -05:00
Yao Zihong 0698fd462a riscv: memcpy_noalignment: Fold SZREG/BLOCK_SIZE alignment to single andi
Simplify the alignment steps for SZREG and BLOCK_SIZE multiples. The previous
three-instruction sequences

    addi    a7, a2, -SZREG
    andi    a7, a7, -SZREG
    addi    a7, a7, SZREG

and

    addi    a7, a2, -BLOCK_SIZE
    andi    a7, a7, -BLOCK_SIZE
    addi    a7, a7, BLOCK_SIZE

are equivalent to a single

    andi    a7, a2, -SZREG
    andi    a7, a2, -BLOCK_SIZE

because SZREG and BLOCK_SIZE are powers of two in this context, making the
surrounding addi steps cancel out. Folding to one instruction reduces code
size with identical semantics.

No functional change.

sysdeps/riscv/multiarch/memcpy_noalignment.S: Remove redundant addi around
alignment; keep a single andi for SZREG/BLOCK_SIZE rounding.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-10-30 17:47:24 -05:00
Yao Zihong 444d81284e riscv: memcpy_noalignment: Make register allocation Zca-friendly
Tidy the temporary register allocation to favor registers eligible for
compressed encodings when Zca/Zcb are enabled. This keeps the ABI and
clobber set unchanged and does not alter control flow or memory access
behavior.

No functional change.

sysdeps/riscv/multiarch/memcpy_noalignment.S: Reassign temps to improve
compressed encoding opportunities.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-10-30 17:44:58 -05:00
Joseph Myers 273f80374a Fix RISC-V soft-float _FPU_SETCW for GCC 16 set-but-not-used warnings
The soft-float RISC-V definition of _FPU_SETCW results in a
-Werror=unused-but-set-variable= build failure with GCC mainline (in
math/setfpucw.c) because it does not use the dummy cw variable.
Change it to (void) (cw) as on other architectures to avoid this build
failure.

Tested with build-many-glibcs.py (compilers) for
riscv64-linux-gnu-rv64imac-lp64, which previously failed.
Reviewed-by: Sam James <sam@gentoo.org>
2025-09-13 07:11:44 -07:00
Adhemerval Zanella 63ba1a1509 math: Add fetestexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
Adhemerval Zanella 2eb8836de7 math: Add feclearexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
Julian Zhu 4c966c0780 RISC-V: Use builtin for ffs and ffsll while supported extension available
Hardware ctz instructions are available in the RISC-V Zbb and XTheadBb extension. With special `-march` flags defined, we can generate more simplified code compared to the generic implementation of `ffs`/`ffsll`.

Signed-off-by: Julian Zhu <julian.oerv@isrc.iscas.ac.cn>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-04-28 09:51:59 -03:00
Adhemerval Zanella 3e8814903c math: Refactor how to use libm-test-ulps
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.

The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.

It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.

This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.

Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation.  This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.

Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.

The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
2025-03-12 13:40:07 -03:00
Yangyu Chen 3fd2ff7685 RISC-V: Fix IFUNC resolver cannot access gp pointer
In some cases, an IFUNC resolver may need to access the gp pointer to
access global variables. Such an object may have l_relocated == 0 at
this time. In this case, an IFUNC resolver will fail to access a global
variable and cause a SIGSEGV.

This patch fixes this issue by relaxing the check of l_relocated in
elf_machine_runtime_setup, but added a check for SHARED case to avoid
using this code in static-linked executables. Such object have already
set up the gp pointer in load_gp function and l->l_scope will be NULL if
it is a pie object. So if we use these code to set up the gp pointer
again for static-pie, it will causing a SIGSEGV in glibc as original bug
on BZ #31317.

I have also reproduced and checked BZ #31317 using the mold commit
bed5b1731b ("illumos: Treat absolute symbols specially"), this patch can
fix the issue.

Also, we used the wrong gp pointer previously because ref->st_value is
not the relocated address but just the offset from the base address of
ELF. An edge case may happen if we reference gp pointer in a IFUNC
resolver in a PIE object, but it will not happen in compiler-generated
codes since -pie will disable relax to gp. In this case, the GP will be
initialized incorrectly since the ref->st_value is not the address after
relocation. This patch fixes this issue by adding the l->l_addr to
ref->st_value to get the relocated address for the gp pointer. We don't
use SYMBOL_ADDRESS macro here because __global_pointer$ is a special
symbol that has SHN_ABS type, but it will use PC-relative addressing in
the load_gp function using lla.

Closes: BZ #32269
Fixes: 96d1b9ac23 ("RISC-V: Fix the static-PIE non-relocated object check")

Co-authored-by: Vivian Wang <dramforever@live.com>
Signed-off-by: Yangyu Chen <cyy@cyyself.name>
2025-02-25 13:08:53 +01:00
Adhemerval Zanella 8f170dc819 math: Use tanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      85.1683        47.7990        43.88%
x86_64v2                    76.8219        41.4679        46.02%
x86_64v3                    73.7775        37.7734        48.80%
aarch64 (Neoverse)          35.4514        18.0742        49.02%
power8                      22.7604        10.1054        55.60%
power10                     22.1358         9.9553        55.03%

reciprocal-throughput        master        patched   improvement
x86_64                      41.0174        19.4718        52.53%
x86_64v2                    34.8565        11.3761        67.36%
x86_64v3                    34.0325         9.6989        71.50%
aarch64 (Neoverse)          25.4349         9.2017        63.82%
power8                      13.8626         3.8486        72.24%
power10                     11.7933         3.6420        69.12%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella de2fca9fe2 math: Use sinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic sinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      47.5710        38.4455        19.18%
x86_64v2                    46.8828        40.7563        13.07%
x86_64v3                    44.0034        34.1497        22.39%
aarch64 (Neoverse)          19.2493        14.1968        26.25%
power8                      23.5312        16.3854        30.37%
power10                     22.6485        10.2888        54.57%

reciprocal-throughput        master        patched   improvement
x86_64                      21.8858        11.6717        46.67%
x86_64v2                    22.0620        11.9853        45.67%
x86_64v3                    21.5653        11.3291        47.47%
aarch64 (Neoverse)          13.0615         6.5499        49.85%
power8                      16.2030         6.9580        57.06%
power10                     12.8911         4.2858        66.75%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella be85208b9f math: Use cospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                    master        patched   improvement
x86_64                    47.4679        38.4157        19.07%
x86_64v2                  46.9686        38.3329        18.39%
x86_64v3                  43.8929        31.8510        27.43%
aarch64 (Neoverse)        18.8867        13.2089        30.06%
power8                    22.9435         7.8023        65.99%
power10                   15.4472        7.77505        49.67%

reciprocal-throughput      master        patched   improvement
x86_64                    20.9518        11.4991        45.12%
x86_64v2                  19.8699        10.5921        46.69%
x86_64v3                  19.3475         9.3998        51.42%
aarch64 (Neoverse)        12.5767         6.2158        50.58%
power8                    15.0566         3.2654        78.31%
power10                    9.2866         3.1147        66.46%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella 95a01ea955 math: Use atanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                     master        patched   improvement
x86_64                     66.3296        52.7558        20.46%
x86_64v2                   66.0429        51.4007        22.17%
x86_64v3                   60.6294        48.7876        19.53%
aarch64 (Neoverse)         24.3163        20.9110        14.00%
power8                     16.5766        13.3620        19.39%
power10                    16.5115        13.4072        18.80%

reciprocal-throughput       master        patched   improvement
x86_64                     30.8599        16.0866        47.87%
x86_64v2                   29.2286        15.4688        47.08%
x86_64v3                   23.0960        12.8510        44.36%
aarch64 (Neoverse)         15.4619        10.6752        30.96%
power8                      7.9200         5.2483        33.73%
power10                     6.8539         4.6262        32.50%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella 1cd9ccd8c0 math: Use atan2pif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atan2pif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 79.4006        70.8726        10.74%
x86_64v2               77.5136        69.1424        10.80%
x86_64v3               71.8050        68.1637         5.07%
aarch64 (Neoverse)     27.8363        24.7700        11.02%
power8                 39.3893        17.2929        56.10%
power10                19.7200        16.8187        14.71%

reciprocal-throughput   master        patched   improvement
x86_64                 38.3457        30.9471        19.29%
x86_64v2               37.4023        30.3112        18.96%
x86_64v3               33.0713        24.4891        25.95%
aarch64 (Neoverse)     19.3683        15.3259        20.87%
power8                 19.5507        8.27165        57.69%
power10                9.05331        7.63775        15.64%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella ae679a0aca math: Use asinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic asinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 46.4996        41.6126        10.51%
x86_64v2               46.7551        38.8235        16.96%
x86_64v3               42.6235        33.7603        20.79%
aarch64 (Neoverse)     17.4161        14.3604        17.55%
power8                 10.7347         9.0193        15.98%
power10                10.6420         9.0362        15.09%

reciprocal-throughput   master        patched   improvement
x86_64                 24.7208        16.5544        33.03%
x86_64v2               24.2177        14.8938        38.50%
x86_64v3               20.5617        10.5452        48.71%
aarch64 (Neoverse)     13.4827        7.17613        46.78%
power8                 6.46134        3.56089        44.89%
power10                5.79007        3.49544        39.63%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella edb2a8f0ae math: Use acospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic acospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                  master        patched   improvement
x86_64                  54.8281        42.9070        21.74%
x86_64v2                54.1717        42.7497        21.08%
x86_64v3                49.3552        34.1512        30.81%
aarch64 (Neoverse)      17.9395        14.3733        19.88%
power8                  20.3110         8.8609        56.37%
power10                 11.3113        8.84067        21.84%

reciprocal-throughput    master        patched   improvement
x86_64                  21.2301        14.4803        31.79%
x86_64v2                20.6858        13.9506        32.56%
x86_64v3                16.1944        11.3377        29.99%
aarch64 (Neoverse)      11.4474        7.13282        37.69%
power8                  10.6916        3.57547        66.56%
power10                 4.64269        3.54145        23.72%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Florian Weimer 4a9a8a5098 Add missing include guards to <dl-tls.h>
Some architecture-specific variants lack header inclusion guards.
Add them for consistency with the generic version.
2025-01-10 19:02:47 +01:00
Florian Weimer 7a3e2e877a Move <thread_pointer.h> to kernel-independent sysdeps directories
Hurd is expected to use the same thread ABI as Linux.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
2025-01-09 19:30:16 +01:00
Darius Rad 274f73978c
riscv: Update nofpu libm test ulps
Fixes math test failures.

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2025-01-08 21:38:24 +01:00
Aurelien Jarno d4b16e22e7
RISC-V: Regenerate ULPs
Generated on a VisionFive 2 board running Linux version 6.12.6 and
GCC 14.2.0.

Needed due to:
- commit bbd578b38d ("math: Use expm1f from CORE-MATH")
- commit 8ae9e51376 ("math: Use log1pf from CORE-MATH")
- commit 0ae0af68d8 ("Implement C23 cospi")
- commit 776938e8b8 ("Implement C23 sinpi")
- commit f9e90e4b4c ("Implement C23 tanpi")
- commit 28d102d15c ("Implement C23 acospi")
- commit f962932206 ("Implement C23 asinpi")
- commit ffe79c446c ("Implement C23 atanpi")
- commit 3374de9038 ("Implement C23 atan2pi")
- commit a357d6273f ("math: Use atanf from CORE-MATH")
- commit 6f9bacf36b ("math: Use atan2f from CORE-MATH")
- commit e5ca265a9c ("new inputs with large errors for [a]cospi,
  [a]sinpi, [a]tanpi, atan2pi")

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
2025-01-02 20:46:24 +01:00
Florian Weimer cc74583f23 elf: Remove the remaining uses of GET_ADDR_OFFSET
Expand the macro where it is used in static definitions of
__tls_get_addr.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-02 13:45:27 +01:00
Paul Eggert 2642002380 Update copyright dates with scripts/update-copyrights 2025-01-01 11:22:09 -08:00
Adhemerval Zanella 0e0be3ed80 math: Use tanhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic tanhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      51.5273        41.0951        20.25%
x86_64v2                    47.7021        39.1526        17.92%
x86_64v3                    45.0373        34.2737        23.90%
i686                       133.9970        83.8596        37.42%
aarch64 (Neoverse)          21.5439        14.7961        31.32%
power10                     13.3301         8.4406        36.68%

reciprocal-throughput        master        patched   improvement
x86_64                      24.9493        12.8547        48.48%
x86_64v2                    20.7051        12.7761        38.29%
x86_64v3                    19.2492        11.0851        42.41%
i686                        78.6498        29.8211        62.08%
aarch64 (Neoverse)          11.6026        7.11487        38.68%
power10                      6.3328         2.8746        54.61%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 1751c0519a math: Use sinhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic sinhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      52.6819        49.1489         6.71%
x86_64v2                    49.1162        42.9447        12.57%
x86_64v3                    46.9732        39.9157        15.02%
i686                       141.1470       129.6410         8.15%
aarch64 (Neoverse)          20.8539        17.1288        17.86%
power10                     14.5258        9.1906         36.73%

reciprocal-throughput        master        patched   improvement
x86_64                      27.5553        23.9395        13.12%
x86_64v2                    21.6423        20.3219         6.10%
x86_64v3                    21.4842        16.0224        25.42%
i686                        87.9709        86.1626         2.06%
aarch64 (Neoverse)          15.1919        12.2744        19.20%
power10                      7.2188         5.2611        27.12%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 9583836785 math: Use coshf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode),
although it should worse performance than current one.  The current
implementation performance comes mainly from the internal usage of
the optimize expf implementation, and shows a maximum ULPs of 2 for
FE_TONEAREST and 3 for other rounding modes.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6995        49.0737       -20.58%
x86_64v2                    40.5841        44.3604        -9.30%
x86_64v3                    39.3879        39.7502        -0.92%
i686                       112.3380       129.8570       -15.59%
aarch64 (Neoverse)          18.6914        17.0946         8.54%
power10                     11.1343        9.3245         16.25%

reciprocal-throughput        master        patched   improvement
x86_64                      18.6471        24.1077       -29.28%
x86_64v2                    17.7501        20.2946       -14.34%
x86_64v3                    17.8262        17.1877         3.58%
i686                        64.1454        86.5645       -34.95%
aarch64 (Neoverse)          9.77226        12.2314       -25.16%
power10                      4.0200        5.3316        -32.63%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 7cfd8b5698 math: Use atanhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atanhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      59.4930        45.8568        22.92%
x86_64v2                    59.5705        45.5804        23.48%
x86_64v3                    53.1838        37.7155        29.08%
i686                        169.354       133.5940        21.12%
aarch64 (Neoverse)          26.0781        16.9829        34.88%
power10                     15.6591        10.7623        31.27%

reciprocal-throughput        master        patched   improvement
x86_64                      23.5903        18.5766        21.25%
x86_64v2                    22.6489        18.2683        19.34%
x86_64v3                    19.0401        13.9474        26.75%
i686                        97.6034       107.3260        -9.96%
aarch64 (Neoverse)          15.3664        9.57846        37.67%
power10                      6.8877        4.6242         32.86%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 6f9bacf36b math: Use atan2f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atan2f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.1175        69.2014        -1.59%
x86_64v2                    66.9884        66.0081         1.46%
x86_64v3                    57.7034        61.6407        -6.82%
i686                       189.8690        152.7560       19.55%
aarch64 (Neoverse)          32.6151        24.5382        24.76%
power10                     21.7282        17.1896        20.89%

reciprocal-throughput        master        patched   improvement
x86_64                      34.5202        31.6155         8.41%
x86_64v2                    32.6379        30.3372         7.05%
x86_64v3                    34.3677        23.6455        31.20%
i686                       157.7290        75.8308        51.92%
aarch64 (Neoverse)          27.7788        16.2671        41.44%
power10                     15.5715         8.1588        47.60%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella a357d6273f math: Use atanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic atanf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      56.8265        53.6842         5.53%
x86_64v2                    54.8177        53.6842         2.07%
x86_64v3                    46.2915        48.7034        -5.21%
i686                       158.3760        108.9560       31.20%
aarch64 (Neoverse)           21.687        20.5893         5.06%
power10                     13.1903        13.5012        -2.36%

reciprocal-throughput        master        patched   improvement
x86_64                      16.6787        16.7601        -0.49%
x86_64v2                    16.6983        16.7601        -0.37%
x86_64v3                    16.2268        12.1391        25.19%
i686                       138.6840        36.0640        74.00%
aarch64 (Neoverse)          11.8012        10.3565        12.24%
power10                      5.3212         4.2894        19.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella ed608a40e2 math: Use asinhf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic asinhf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      64.5128        56.9717        11.69%
x86_64v2                    63.3065        57.2666         9.54%
x86_64v3                    62.8719        51.4170        18.22%
i686                       189.1630        137.635        27.24%
aarch64 (Neoverse)          25.3551        20.5757        18.85%
power10                     17.9712        13.3302        25.82%

reciprocal-throughput        master        patched   improvement
x86_64                      20.0844        15.4731        22.96%
x86_64v2                    19.2919        15.4000        20.17%
x86_64v3                    18.7226        11.9009        36.44%
i686                       103.7670        80.2681        22.65%
aarch64 (Neoverse)          12.5005        8.68969        30.49%
power10                      7.2220        5.03617        30.27%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>:
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 5fb4b566ef math: Use asinf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic asinf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      42.8237        35.2460        17.70%
x86_64v2                    43.3711        35.9406        17.13%
x86_64v3                    35.0335        30.5744        12.73%
i686                       213.8780        104.4710       51.15%
aarch64 (Neoverse)          17.2937        13.6025        21.34%
power10                     12.0227        7.4241         38.25%

reciprocal-throughput        master        patched   improvement
x86_64                      13.6770        15.5231       -13.50%
x86_64v2                    13.8722        16.0446       -15.66%
x86_64v3                    13.6211        13.2753         2.54%
i686                       186.7670        45.4388        75.67%
aarch64 (Neoverse)          9.96089        9.39285         5.70%
power10                      4.9862        3.7819         24.15%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 673e6fe110 math: Use acoshf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acoshf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      61.2471        58.7742         4.04%
x86_64-v2                   62.6519        59.0523         5.75%
x86_64-v3                   58.7408        50.1393        14.64%
aarch64                     24.8580        21.3317        14.19%
power10                     17.0469        13.1345        22.95%

reciprocal-throughput        master        patched   improvement
x86_64                      16.1618        15.1864         6.04%
x86_64-v2                   15.7729        14.7563         6.45%
x86_64-v3                   14.1669        11.9568        15.60%
aarch64                      10.911        9.5486         12.49%
power10                     6.38196        5.06734        20.60%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 66fa7ad437 math: Use acosf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acosf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      52.5098        36.6312        30.24%
x86_64v2                    53.0217        37.3091        29.63%
x86_64v3                    42.8501        32.3977        24.39%
i686                       207.3960       109.4000        47.25%
aarch64                     21.3694        13.7871        35.48%
power10                     14.5542         7.2891        49.92%

reciprocal-throughput        master        patched   improvement
x86_64                      14.1487        15.9508       -12.74%
x86_64v2                    14.3293        16.1899       -12.98%
x86_64v3                    13.6563        12.6161         7.62%
i686                       158.4060        45.7354        71.13%
aarch64                     12.5515        9.19233        26.76%
power10                      5.7868         3.3487        42.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Michael Jeanson 8dd1588794 nptl: Add <thread_pointer.h> for RISC-V
This will be required by the rseq extensible ABI implementation on all
Linux architectures exposing the '__rseq_size' and '__rseq_offset'
symbols to set the initial value of the 'cpu_id' field which can be used
by applications to test if rseq is available and registered. As long as
the symbols are exposed it is valid for an application to perform this
test even if rseq is not yet implemented in libc for this architecture.

Both code paths tested on a Visionfive 2 with Debian sid.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-12-09 13:26:55 -05:00
Adhemerval Zanella 17a43505b3 elf: Consolidate stackinfo.h
And use sane default the generic implementation.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-12-02 17:14:58 +00:00
Julian Zhu 32445b6dd2 RISC-V: Use builtin for fma and fmaf
The built-in functions `builtin_{fma, fmaf}` are sufficient to generate correct `fmadd.d`/`fmadd.s` instructions on RISC-V.

Signed-off-by: Julian Zhu <jz531210@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-25 16:45:59 -03:00
Julian Zhu d2264de5db RISC-V: Use builtin for copysign and copysignf
The built-in functions `builtin_{copysign, copysignf}` are sufficient to generate correct `fsgnj.d/fsgnj.s` instructions on RISC-V.

Signed-off-by: Julian Zhu <jz531210@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-11-25 16:45:59 -03:00
Adhemerval Zanella bccb0648ea math: Use tanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       82.3961       54.8052       33.49%
x86_64v2                     82.3415       54.8052       33.44%
x86_64v3                     69.3661       50.4864       27.22%
i686                         219.271       45.5396       79.23%
aarch64                      29.2127       19.1951       34.29%
power10                      19.5060       16.2760       16.56%

reciprocal-throughput         master       patched  improvement
x86_64                       28.3976       19.7334       30.51%
x86_64v2                     28.4568       19.7334       30.65%
x86_64v3                     21.1815       16.1811       23.61%
i686                         105.016       15.1426       85.58%
aarch64                      18.1573       10.7681       40.70%
power10                       8.7207        8.7097        0.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella d846f4c12d math: Use lgammaf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic lgammaf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, to use math_narrow_eval
on overflow usage, and to adapt to make it reentrant.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       86.5609       70.3278       18.75%
x86_64v2                     78.3030       69.9709       10.64%
x86_64v3                     74.7470       59.8457       19.94%
i686                         387.355       229.761       40.68%
aarch64                      40.8341       33.7563       17.33%
power10                      26.5520       16.1672       39.11%
powerpc                      28.3145       17.0625       39.74%

reciprocal-throughput         master       patched  improvement
x86_64                       68.0461       48.3098       29.00%
x86_64v2                     55.3256       47.2476       14.60%
x86_64v3                     52.3015       38.9028       25.62%
i686                         340.848       195.707       42.58%
aarch64                      36.8000       30.5234       17.06%
power10                      20.4043       12.6268       38.12%
powerpc                      22.6588       13.8866       38.71%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella baa495f231 math: Use erfcf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erfcf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       98.8796       66.2142       33.04%
x86_64v2                     98.9617       67.4221       31.87%
x86_64v3                     87.4161       53.1754       39.17%
aarch64                      33.8336       22.0781       34.75%
power10                      21.1750       13.5864       35.84%
powerpc                      21.4694       13.8149       35.65%

reciprocal-throughput         master       patched  improvement
x86_64                       48.5620       27.6731       43.01%
x86_64v2                     47.9497       28.3804       40.81%
x86_64v3                     42.0255       18.1355       56.85%
aarch64                      24.3938       13.4041       45.05%
power10                      10.4919        6.1881       41.02%
powerpc                       11.763       6.76468       42.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella 994fec2397 math: Use erff from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erff.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       85.7363       45.1372       47.35%
x86_64v2                     86.6337       38.5816       55.47%
x86_64v3                     71.3810       34.0843       52.25%
i686                         190.143       97.5014       48.72%
aarch64                      34.9091       14.9320       57.23%
power10                      38.6160        8.5188       77.94%
powerpc                      39.7446       8.45781       78.72%

reciprocal-throughput         master       patched  improvement
x86_64                       35.1739       14.7603       58.04%
x86_64v2                     34.5976       11.2283       67.55%
x86_64v3                     27.3260        9.8550       63.94%
i686                         91.0282       30.8840       66.07%
aarch64                      22.5831        6.9615       69.17%
power10                      18.0386        3.0918       82.86%
powerpc                      20.7277       3.63396       82.47%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella c5d241f06b math: Use cbrtf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cbrtf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master        patched       improvement
x86_64                       68.6348        36.8908            46.25%
x86_64v2                     67.3418        36.6968            45.51%
x86_64v3                     63.4981        32.7859            48.37%
aarch64                      29.3172        12.1496            58.56%
power10                      18.0845         8.8893            50.85%
powerpc                      18.0859        8.79527            51.37%

reciprocal-throughput         master        patched       improvement
x86_64                       36.4369        13.3565            63.34%
x86_64v2                     37.3611        13.1149            64.90%
x86_64v3                     31.6024        11.2102            64.53%
aarch64                      18.6866        7.3474             60.68%
power10                       9.4758        3.6329             61.66%
powerpc                      9.58896        3.90439            59.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-22 10:01:03 -03:00
Adhemerval Zanella f338c7c5f5 math: Use log10p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log10p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.5251        32.2627        52.92%
x86_64v2                    68.8912        32.7887        52.41%
x86_64v3                    59.3427        27.0521        54.41%
i686                        162.026        103.383        36.19%
aarch64                     26.8513        14.5695        45.74%
power10                     12.7426         8.4929        33.35%
powerpc                     16.6768        9.29135        44.29%

reciprocal-throughput        master        patched   improvement
x86_64                      26.0969        12.4023        52.48%
x86_64v2                    25.0045        11.0748        55.71%
x86_64v3                    20.5610        10.2995        49.91%
i686                        89.8842        78.5211        12.64%
aarch64                     17.1200         9.4832        44.61%
power10                      6.7814         6.4258         5.24%
powerpc                      15.769         7.6825        51.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella 8ae9e51376 math: Use log1pf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log1pf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      71.8142        38.9668        45.74%
x86_64v2                    71.9094        39.1321        45.58%
x86_64v3                    60.1000        32.4016        46.09%
i686                        147.105        104.258        29.13%
aarch64                     26.4439        14.0050        47.04%
power10                     19.4874         9.4146        51.69%
powerpc                     17.6145        8.00736        54.54%

reciprocal-throughput        master        patched   improvement
x86_64                      19.7604        12.7254        35.60%
x86_64v2                    19.0039        11.9455        37.14%
x86_64v3                    16.8559        11.9317        29.21%
i686                        82.3426        73.9718        10.17%
aarch64                     14.4665         7.9614        44.97%
power10                     11.9974         8.4117        29.89%
powerpc                     7.15222         6.0914        14.83%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00