glibc/sysdeps
Adhemerval Zanella 8a0152b61b math: New generic fmaf implementation
The current implementation relies on setting the rounding mode for
different calculations (FE_TOWARDZERO) to obtain correctly rounded
results.  For most CPUs, this adds significant performance overhead
because it requires executing a typically slow instruction (to
get/set the floating-point status), necessitates flushing the
pipeline, and breaks some compiler assumptions/optimizations.

The original implementation adds tests to handle underflow in corner
cases, whereas this implementation uses a different strategy that
checks both the mantissa and the result to determine whether the
result is not subject to double rounding.

I tested this implementation on various targets (x86_64, i686, arm,
aarch64, powerpc), including some by manually disabling the compiler
instructions.

Performance-wise, it shows large improvements:

reciprocal-throughput       master       patched       improvement
x86_64 [1]                   58.09          7.96             7.33x
i686 [1]                    279.41         16.97            16.46x
aarch64 [2]                  26.09          4.10             6.35x
armhf [2]                    30.25          4.20             7.18x
powerpc [3]                   9.46          1.46             6.45x

latency                     master       patched       improvement
x86_64                       64.50         14.25             4.53x
i686                        304.39         61.04             4.99x
aarch64                      27.71          5.74             4.82x
armhf                        33.46          7.34             4.55x
powerpc                      10.96          2.65             4.13x

Checked on x86_64-linux-gnu and i686-linux-gnu with —disable-multi-arch,
and on arm-linux-gnueabihf.

[1] gcc 15.2.1, Zen3
[2] gcc 15.2.1, Neoverse N1
[3] gcc 15.2.1, POWER10

Signed-off-by: Szabolcs Nagy <nsz@gcc.gnu.org>
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-27 14:52:25 -03:00
..
aarch64 aarch64: make GCS configure checks aarch64-only 2025-11-26 13:50:15 +00:00
alpha Add gmp-arch and udiv_qrnnd 2025-11-25 14:52:15 -03:00
arc
arm math: New generic fma implementation 2025-11-26 10:10:06 -03:00
csky
generic stdlib: Remove longlong.h 2025-11-26 10:10:06 -03:00
gnu
hppa Add gmp-arch and udiv_qrnnd 2025-11-25 14:52:15 -03:00
htl htl: move c11 symbols into libc. 2025-11-22 03:28:48 +01:00
hurd
i386 math: New generic fmaf implementation 2025-11-27 14:52:25 -03:00
ieee754 math: New generic fmaf implementation 2025-11-27 14:52:25 -03:00
loongarch Add add_ssaaaa and sub_ssaaaa to gmp-arch.h 2025-11-26 10:10:02 -03:00
m68k
mach htl: move c11 symbols into libc. 2025-11-22 03:28:48 +01:00
microblaze
mips math: Don't redirect inlined builtin math functions 2025-11-17 11:17:07 -03:00
nptl htl: move c11 symbols into libc. 2025-11-22 03:28:48 +01:00
or1k Remove TLS_TCB_ALIGN and TLS_INIT_TCB_ALIGN 2025-11-15 22:01:07 +01:00
posix
powerpc Add add_ssaaaa and sub_ssaaaa to gmp-arch.h 2025-11-26 10:10:02 -03:00
pthread htl: move c11 symbols into libc. 2025-11-22 03:28:48 +01:00
riscv Add add_ssaaaa and sub_ssaaaa to gmp-arch.h 2025-11-26 10:10:02 -03:00
s390 Remove support for lock elision. 2025-11-18 14:21:13 +01:00
sh
sparc Revert __HAVE_64B_ATOMICS configure check 2025-11-14 14:05:20 -03:00
unix Linux: Ignore PIDFD_GET_INFO in tst-pidfd-consts 2025-11-27 14:34:58 +01:00
wordsize-32 stdlib: Remove longlong.h 2025-11-26 10:10:06 -03:00
wordsize-64
x86 stdlib: Remove longlong.h 2025-11-26 10:10:06 -03:00
x86_64 x86: Fix strstr ifunc on clang 2025-11-17 11:17:07 -03:00