Commit Graph

1383 Commits

Author SHA1 Message Date
Adhemerval Zanella 1abdb38135 math: Handle fabsf128 !__USE_EXTERN_INLINES
Work around the clang limitation wrt inline function and attribute
definition, where it does not allow to 'add' new attribute if a
function is already defined:

clang on x86_64 fails to build s_fabsf128.c with:

../sysdeps/ieee754/float128/../ldbl-128/s_fabsl.c:32:1: error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
   32 | libm_alias_ldouble (__fabs, fabs)
      | ^
../sysdeps/generic/libm-alias-ldouble.h:63:38: note: expanded from macro 'libm_alias_ldouble'
   63 | #define libm_alias_ldouble(from, to) libm_alias_ldouble_r (from, to, )
      |                                      ^
../sysdeps/ieee754/float128/float128_private.h:133:43: note: expanded from macro 'libm_alias_ldouble_r'
  133 | #define libm_alias_ldouble_r(from, to, r) libm_alias_float128_r (from, to, r)
      |                                           ^
../sysdeps/ieee754/float128/s_fabsf128.c:5:3: note: expanded from macro 'libm_alias_float128_r'
    5 |   static_weak_alias (from ## f128 ## r, to ## f128 ## r);       \
      |   ^
./../include/libc-symbols.h:166:46: note: expanded from macro 'static_weak_alias'
  166 | #  define static_weak_alias(name, aliasname) weak_alias (name, aliasname)
      |                                              ^
./../include/libc-symbols.h:154:38: note: expanded from macro 'weak_alias'
  154 | # define weak_alias(name, aliasname) _weak_alias (name, aliasname)
      |                                      ^
./../include/libc-symbols.h:156:52: note: expanded from macro '_weak_alias'
  156 |   extern __typeof (name) aliasname __attribute__ ((weak, alias (#name))) \
      |                                                    ^
../include/math.h:134:1: note: previous definition is here
  134 | fabsf128 (_Float128 x)

If compiler does not support __USE_EXTERN_INLINES we need to route
fabsf128 call to an internal symbol.
2025-11-17 11:17:07 -03:00
Adhemerval Zanella 13cfd77bf5 math: Don't redirect inlined builtin math functions
When we want to inline builtin math functions, like truncf, for

  extern float truncf (float __x) __attribute__ ((__nothrow__ )) __attribute__ ((__const__));
  extern float __truncf (float __x) __attribute__ ((__nothrow__ )) __attribute__ ((__const__));

  float (truncf) (float) asm ("__truncf");

compiler may redirect truncf calls to __truncf, instead of inlining it
(for instance, clang).  The USE_TRUNCF_BUILTIN is 1 to indicate that
truncf should be inlined.  In this case, we don't want the truncf
redirection:

  1. For each math function which may be inlined, we define

  #if USE_TRUNCF_BUILTIN
   # define NO_truncf_BUILTIN inline_truncf
   #else
   # define NO_truncf_BUILTIN truncf
   #endif

in <math-use-builtins.h>.

  2. Include <math-use-builtins.h> in include/math.h.

  3. Change MATH_REDIRECT to

   #define MATH_REDIRECT(FUNC, PREFIX, ARGS)		\
    float (NO_ ## FUNC ## f ## _BUILTIN) (ARGS (float))	\
      asm (PREFIX #FUNC "f");

With this change If USE_TRUNCF_BUILTIN is 0, we get

  float (truncf) (float) asm ("__truncf");
  truncf will be redirected to __truncf.

And for USE_TRUNCF_BUILTIN 1, we get:

  float (inline_truncf) (float) asm ("__truncf");

In both cases either truncf will be inlined or the internal alias
(__truncf) will be called.

It is not required for all math-use-builtin symbol, only the one
defined in math.h.  It also allows to remove all the math-use-builtin
inclusion, since it is now implicitly included by math.h.

For MIPS, some math-use-builtin headers include sysdep.h and this
in turn includes a lot of extra headers that do not allow ldbl-128
code to override alias definition (math.h will include
some stdlib.h definition).  The math-use-builtin only requires
the __mips_isa_rev, so move the defintion to sgidefs.h.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-11-17 11:17:07 -03:00
Osama Abdelkader 4f18501498 math: Optimize frexpl (intel96) with fast path for normal numbers
Add fast path optimization for frexpl (80-bit x87 extended precision) using
a single unsigned comparison to identify normal floating-point numbers and
return immediately via arithmetic on the exponent field.

The implementation uses arithmetic operations (se - ex ) to
adjust the exponent directly, which is simpler than bit masking. For subnormals,
the traditional multiply-based normalization is retained as it handles the
split word format more reliably.

The zero/infinity/NaN check groups these special cases together for better
branch prediction.

Benchmark results on Intel Core i9-13900H (13th Gen):
  Baseline:     25.543 ns/op
  Optimized:    25.531 ns/op
  Speedup:      1.00x (neutral)
  Zero:         17.774 ns/op
  Denormal:     23.900 ns/op

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-14 19:52:38 +00:00
Joseph Myers 1f79bc4838 Change fromfp functions to return floating types following C23 (bug 28327)
As discussed in bug 28327, C23 changed the fromfp functions to return
floating types instead of intmax_t / uintmax_t.  (Although the
motivation in N2548 was reducing the use of intmax_t in library
interfaces, the new version does have the advantage of being able to
specify arbitrary integer widths for e.g. assigning the result to a
_BitInt, as well as being able to indicate an error case in-band with
a NaN return.)

As with other such changes from interfaces introduced in TS 18661,
implement the new types as a replacement for the old ones, with the
old functions remaining as compat symbols but not supported as an API.
The test generator used for many of the tests is updated to handle
both versions of the functions.

Tested for x86_64 and x86, and with build-many-glibcs.py.

Also tested tgmath tests for x86_64 with GCC 7 to make sure that the
modified case for older compilers in <tgmath.h> does work.

Also tested for powerpc64le to cover the ldbl-128ibm implementation
and the other things that are handled differently for that
configuration.  The new tests fail for ibm128, but all the failures
relate to incorrect signs of zero results and turn out to arise from
bugs in the underlying roundl, ceill, truncl and floorl
implementations that I've reported in bug 33623, rather than
indicating any bug in the actual new implementation of the functions
for that format.  So given fixes for those functions (which shouldn't
be hard, and of course should add to the tests for those functions
rather than relying only on indirect testing via fromfp), the fromfp
tests should start passing for ibm128 as well.
2025-11-13 00:04:21 +00:00
Wilco Dijkstra 989e538224 math: Remove float_t and double_t [BZ #33563]
Remove uses of float_t and double_t. This is not useful on modern machines,
and does not help given GCC defaults to -fexcess-precision=fast.
One use of double_t remains to allow forcing the precision to double
on targets where FLT_EVAL_METHOD=2. This fixes BZ #33563 on
i486-pc-linux-gnu.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-12 19:33:23 +00:00
Wilco Dijkstra 3b7bb7b2f2 math: Remove ldbl-128/s_fma.c
Remove ldbl-128/s_fma.c - it makes no sense to use emulated float128
operations to emulate FMA.  Benchmarking shows dbl-64/s_fma.c is about
twice as fast.  Remove redundant dbl-64/s_fma.c includes in targets
that were trying to work around this issue.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-12 18:57:29 +00:00
Osama Abdelkader e52d9542cd math: Optimize frexpl (binary128) with fast path for normal numbers
Add fast path optimization for frexpl (128-bit IEEE quad precision) using
a single unsigned comparison to identify normal floating-point numbers and
return immediately via arithmetic on the exponent field.

The implementation uses arithmetic operations hx = hx - (ex << 48)
to adjust the exponent in place, which is simpler and more efficient than
bit masking. For subnormals, the traditional multiply-based normalization
is retained for reliability with the split 64-bit word format.

The zero/infinity/NaN check groups these special cases together for better
branch prediction.

This optimization provides the same algorithmic improvements as the other
frexp variants while maintaining correctness for all edge cases.

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-10 08:58:19 -03:00
Osama Abdelkader e05476b5c8 math: Optimize frexp (binary64) with fast path for normal numbers
Add fast path optimization for frexp using a single unsigned comparison
to identify normal floating-point numbers and return immediately via
arithmetic on the bit representation.

The implementation uses asuint64()/asdouble() from math_config.h and arithmetic
operations to adjust the exponent, which generates better code than bit masking
on ARM and RISC-V architectures. For subnormals, stdc_leading_zeros provides
faster normalization than the traditional multiply approach.

The zero/infinity/NaN check is simplified to (int64_t)(ix << 1) <= 0, which
is more efficient than separate comparisons.

Benchmark results on Intel Core i9-13900H (13th Gen):
  Baseline:     6.778 ns/op
  Optimized:    4.007 ns/op
  Speedup:      1.69x (40.9% faster)
  Zero:         3.580 ns/op (fast path)
  Denormal:     6.096 ns/op (slower, rare case)

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-10 08:58:18 -03:00
Osama Abdelkader 4d2582150e math: Optimize frexpf (binary32) with fast path for normal numbers
Add fast path optimization for frexpf using a single unsigned comparison
to identify normal floating-point numbers and return immediately via
arithmetic on the bit representation.

The implementation uses asuint()/asfloat() from math_config.h and arithmetic
operations to adjust the exponent, which generates better code than bit masking
on ARM and RISC-V architectures. For subnormals, stdc_leading_zeros provides
faster normalization than the traditional multiply approach.

The zero/infinity/NaN check is simplified to (int32_t)(hx << 1) <= 0, which
is more efficient than separate comparisons.

Benchmark results on Intel Core i9-13900H (13th Gen):
  Baseline:     5.858 ns/op
  Optimized:    4.003 ns/op
  Speedup:      1.46x (31.7% faster)
  Zero:         3.580 ns/op (fast path)
  Denormal:     5.597 ns/op (slower, rare case)

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-11-10 08:58:18 -03:00
Adhemerval Zanella b983c854e6 math: Sync acosh from CORE-MATH
The c9abdf80 fix handle some cases for RNDZ.

Checked on x86_64-linux-gnu.
2025-11-10 08:58:14 -03:00
Adhemerval Zanella 3078358ac6 math: Remove the SVID error handling from tgammaf
It improves latency for about 1.5% and throughput for about 2-4%.

Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-05 10:19:37 -03:00
Adhemerval Zanella de0e623434 math: Remove the SVID error handling from lgammaf/lgammaf_r
It improves latency throughput for about 2%.

Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-05 09:27:07 -03:00
Adhemerval Zanella 7ec8eb5676 math: Remove the SVID error handling from atan2f
It improves latency for about 3-6% and throughput for about 5-12%.

Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-05 07:15:52 -03:00
Joseph Myers 26e4810210 Rename fromfp files in preparation for changing types for C23
As discussed in bug 28327, the fromfp functions changed type in C23
(compared to the version in TS 18661-1); they now return the same type
as the floating-point argument, instead of intmax_t / uintmax_t.

As with other such incompatible changes compared to the initial TS
18661 versions of interfaces (the types of totalorder functions, in
particular), it seems appropriate to support only the new version as
an API, not the old one (although many programs written for the old
API might in fact work wtih the new one as well).  Thus, the existing
implementations should become compat symbols.  They are sufficiently
different from how I'd expect to implement the new version that using
separate implementations in separate files is more convenient than
trying to share code, and directly sharing testcases would be
problematic as well.

Rename the existing fromfp implementation and test files to names
reflecting how they're intended to become compat symbols, so freeing
up the existing filenames for a subsequent implementation of the C23
versions of these functions (which is the point at which the existing
implementations would actually become compat symbols).

gen-fromfp-tests.py and gen-fromfp-tests-inputs are not renamed; I
think it will make sense to adapt the test generator to be able to
generate most tests for both versions of the functions (with extra
test inputs added that are only of interest with the C23 version).
The ldbl-opt/nldbl-* files are also not renamed; since those are for a
static only library, no compat versions are needed, and they'll just
have their contents changed when the C23 version is implemented.

Tested for x86_64, and with build-many-glibcs.py.
2025-11-04 23:41:35 +00:00
Adhemerval Zanella 0dfc849eff math: Remove the SVID error handling wrapper from sqrt
i386 and m68k architectures should use math-use-builtins-sqrt.h rather
than relying on architecture-specific or inline assembly implementations.

The PowerPC optimization for PPC 601/603 (30 years old) is removed.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella f27a146409 math: Remove the SVID error handling from sinhf
It improves latency for about 3-10% and throughput for about 5-15%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella 0e1a1178ee math: Remove the SVID error handling from remainder
The optimized i386 version is faster than the generic one, and
gcc implements it through the builtin. This optimization enables
us to migrate the implementation to a C version.  The performance
on a Zen3 chip is similar to the SVID one.

The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).

Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):

reciprocal-throughput           input    master   NO-SVID  improvement
x86_64                     subnormals   18.8522   16.2506       13.80%
x86_64                         normal  421.8260  403.9270        4.24%
x86_64                 close-exponent   21.0579   18.7642       10.89%
i686                       subnormals   21.3443   21.4229       -0.37%
i686                           normal  525.8380   538.807       -2.47%
i686                   close-exponent   21.6589   21.7983       -0.64%

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella c4c6c79d70 math: Remove the SVID error handling from remainderf
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin.  This optimization enables us to
migrate the implementation to a C version.  The performance on a Zen3
chip is similar to the SVID one.

The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin (different
than i386).

Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):

reciprocal-throughput          input   master  NO-SVID  improvement
x86_64                    subnormals  17.5349  15.6125       10.96%
x86_64                        normal  53.8134  52.5754        2.30%
x86_64                close-exponent  20.0211  18.6656        6.77%
i686                      subnormals  21.8105  20.1856        7.45%
i686                          normal  73.1945  71.2199        2.70%
i686                  close-exponent  22.2141   20.331        8.48%

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Wilco Dijkstra 0212fc23b0 math: Fix pow special case [BZ #33563]
Fix pow (DBL_MAX, 1.0) to return DBL_MAX when rouding upwards without FMA.
This fixes BZ #33563.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-31 19:13:41 +00:00
Wilco Dijkstra 8917bd3eb3 math: Fix powf special case [BZ #33563]
Fix powf (0x1.fffffep+127, 1.0f) to return 0x1.fffffep+127 when
rouding upwards.  Cleanup the special case code - performance
improves by ~1.2%.  This fixes BZ #33563.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-31 19:12:47 +00:00
Collin Funk 3fe3f62833 Cleanup some recently added whitespace.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-10-30 18:56:58 -07:00
Adhemerval Zanella ee946212fe math: Remove the SVID error handling wrapper from yn/jn
Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:35 -03:00
Adhemerval Zanella 8d4815e6d7 math: Remove the SVID error handling wrapper from y1/j1
Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:33 -03:00
Adhemerval Zanella b050cb53b0 math: Remove the SVID error handling wrapper from y0/j0
Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:31 -03:00
Adhemerval Zanella 03eeeba705 math: Remove the SVID error handling from coshf
It improves latency for about 3-10% and throughput for about 5-15%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:28 -03:00
Adhemerval Zanella 555c39c0fc math: Remove the SVID error handling from atanhf
It improves latency for about 1-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:26 -03:00
Adhemerval Zanella 8facb464b4 math: Remove the SVID error handling from acoshf
It improves latency for about 3-7% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:24 -03:00
Adhemerval Zanella f92aba68bc math: Remove the SVID error handling from asinf
It improves latency for about 2% and throughput for about 5%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:22 -03:00
Adhemerval Zanella 9f8dea5b5d math: Remove the SVID error handling from acosf
It improves latency for about 2-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:20 -03:00
Adhemerval Zanella 0b484d7b77 math: Remove the SVID error handling from log10f
It improves latency for about 3-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-30 15:41:17 -03:00
Adhemerval Zanella 970364dac0 Annotate swtich fall-through
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does.  Use C23
[[fallthrough]] annotation instead.
proper attribute instead.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>
2025-10-29 12:54:01 -03:00
Adhemerval Zanella 36b4c553e6 Replace count_leading_zeros with stdc_leading_zeros
Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
2025-10-29 12:53:55 -03:00
Adhemerval Zanella 013f5167b9 math: Consolidate CORE-MATH double-double routines
For lgamma and tgamma the muldd, mulddd, and polydd are renamed
to muldd2, mulddd2, and polydd2 respectively.

Checked on aarch64-linux-gnu and x86_64-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:46:04 -03:00
Adhemerval Zanella e4d812c980 math: Consolidate erf/erfc definitions
The common code definitions are consolidated in s_erf_common.h
and s_erf_common.c.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:46:01 -03:00
Adhemerval Zanella fc419290f9 math: Consolidate internal erf/erfc tables
The shared internal data definitions are consolidated in
s_erf_data.c and the erfc only one are moved to s_erfc_data.c.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella acaad9ab06 math: Use erfc from CORE-MATH
The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,5], [-5,5], [5,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MAX, -5]
 * FE_TONEAREST
     0:      10000000000 100.00%
 * FE_UPWARD
     0:      10000000000 100.00%
 * FE_DOWNWARD
     0:      10000000000 100.00%
 * FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-5, 5]
 * FE_TONEAREST
     0:       8069309665  80.69%
     1:       1882910247  18.83%
     2:         47485296   0.47%
     3:           293749   0.00%
     4:             1043   0.00%
 * FE_UPWARD
     0:       5540301026  55.40%
     1:       2026739127  20.27%
     2:       1774882486  17.75%
     3:        567324466   5.67%
     4:         86913847   0.87%
     5:          3820789   0.04%
     6:            18259   0.00%
 * FE_DOWNWARD
     0:       5520969586  55.21%
     1:       2057293099  20.57%
     2:       1778334818  17.78%
     3:        557521494   5.58%
     4:         82473927   0.82%
     5:          3393276   0.03%
     6:            13800   0.00%
 * FE_TOWARDZERO
     0:       6220287175  62.20%
     1:       2323846149  23.24%
     2:       1251999920  12.52%
     3:        190748245   1.91%
     4:         12996232   0.13%
     5:           122279   0.00%

* Range [5, DBL_MAX]
 * FE_TONEAREST
     0:      10000000000 100.00%
 * FE_UPWARD
     0:      10000000000 100.00%
 * FE_DOWNWARD
     0:      10000000000 100.00%
 * FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      49.0980       267.0660      -443.94%
x86_64v2                    49.3220       257.6310      -422.34%
x86_64v3                    42.9539        84.9571       -97.79%
aarch64                     28.7266        52.9096       -84.18%
power10                     14.1673        25.1273       -77.36%

Latency                      master        patched   improvement
x86_64                      95.6640       269.7060      -181.93%
x86_64v2                    95.8296       260.4860      -171.82%
x86_64v3                    91.1658       112.7150       -23.64%
aarch64                     37.0745        58.6791       -58.27%
power10                     23.3197        31.5737       -35.39%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella 72a48e45bd math: Use erf from CORE-MATH
The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MIN, -4.2]
 * FE_TONEAREST
     0:      10000000000 100.00%
 * FE_UPWARD
     0:      10000000000 100.00%
 * FE_DOWNWARD
     0:      10000000000 100.00%
 * FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-4.2, 4.2]
 * FE_TONEAREST
     0:       9764404513  97.64%
     1:        235595487   2.36%
 * FE_UPWARD
     0:       9468013928  94.68%
     1:        531986072   5.32%
 * FE_DOWNWARD
     0:       9493787693  94.94%
     1:        506212307   5.06%
 * FE_TOWARDZERO
     0:       9585271351  95.85%
     1:        414728649   4.15%

* Range [4.2, DBL_MAX]
 * FE_TONEAREST
     0:      10000000000 100.00%
 * FE_UPWARD
     0:      10000000000 100.00%
 * FE_DOWNWARD
     0:      10000000000 100.00%
 * FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master       patched   improvement
x86_64                      38.2754       78.0311      -103.87%
x86_64v2                    38.3325       75.7555       -97.63%
x86_64v3                    34.6604       28.3182        18.30%
aarch64                     23.1499       21.4307         7.43%
power10                     12.3051       9.3766         23.80%

Latency                      master       patched   improvement
x86_64                      84.3062      121.3580       -43.95%
x86_64v2                    84.1817      117.4250       -39.49%
x86_64v3                    81.0933       70.6458        12.88%
aarch64                      35.012       29.5012        15.74%
power10                     21.7205       18.4589        15.02%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella 1cae0550e8 math: Use tgamma from CORE-MATH
The current implementation precision shows the following accuracy, on
one range ([-20,20]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-20,20]
 * FE_TONEAREST
     0:       4504877808  45.05%
     1:       4402224940  44.02%
     2:        947652295   9.48%
     3:        131076831   1.31%
     4:         13222216   0.13%
     5:           910045   0.01%
     6:            35253   0.00%
     7:              606   0.00%
     8:                6   0.00%
 * FE_UPWARD
     0:       3477307921  34.77%
     1:       4838637866  48.39%
     2:       1413942684  14.14%
     3:        240762564   2.41%
     4:         27113094   0.27%
     5:          2130934   0.02%
     6:           102599   0.00%
     7:             2324   0.00%
     8:               14   0.00%
 * FE_DOWNWARD
     0:       3923545410  39.24%
     1:       4745067290  47.45%
     2:       1137899814  11.38%
     3:        171596912   1.72%
     4:         20013805   0.20%
     5:          1773899   0.02%
     6:            99911   0.00%
     7:             2928   0.00%
     8:               31   0.00%
 * FE_TOWARDZERO
     0:       3697160741  36.97%
     1:       4731951491  47.32%
     2:       1303092738  13.03%
     3:        231969191   2.32%
     4:         32344517   0.32%
     5:          3283092   0.03%
     6:           193010   0.00%
     7:             5175   0.00%
     8:               45   0.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                     237.7960       175.4090        26.24%
x86_64v2                   232.9320       163.4460        29.83%
x86_64v3                   193.0680        89.7721        53.50%
aarch64                    113.6340        56.7350        50.07%
power10                     92.0617        26.6137        71.09%

Latency                      master        patched   improvement
x86_64                     266.7190       208.0130        22.01%
x86_64v2                   263.6070       200.0280        24.12%
x86_64v3                   214.0260       146.5180        31.54%
aarch64                    114.4760        58.5235        48.88%
power10                     84.3718        35.7473        57.63%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella d67d2f4688 math: Use lgamma from CORE-MATH
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-20, 20]
 * FE_TONEAREST
     0:       6701254075  67.01%
     1:       3230897408  32.31%
     2:         63986940   0.64%
     3:          3605417   0.04%
     4:           233189   0.00%
     5:            20973   0.00%
     6:             1869   0.00%
     7:              125   0.00%
     8:                4   0.00%
 * FE_UPWARDA
     0:       4207428861  42.07%
     1:       5001137116  50.01%
     2:        740542213   7.41%
     3:         49116304   0.49%
     4:          1715617   0.02%
     5:            54464   0.00%
     6:             4956   0.00%
     7:              451   0.00%
     8:               16   0.00%
     9:                2   0.00%
 * FE_DOWNWARD
     0:       4155925193  41.56%
     1:       4989821364  49.90%
     2:        770312796   7.70%
     3:         72014726   0.72%
     4:         11040522   0.11%
     5:           872811   0.01%
     6:            12480   0.00%
     7:              106   0.00%
     8:                2   0.00%
 * FE_TOWARDZERO
     0:       4225861532  42.26%
     1:       5027051105  50.27%
     2:        706443411   7.06%
     3:         39877908   0.40%
     4:           713109   0.01%
     5:            47513   0.00%
     6:             4961   0.00%
     7:              438   0.00%
     8:               23   0.00%

* Range [20, 0x5.d53649e2d4674p+1012]
 * FE_TONEAREST
     0:       7262241995  72.62%
     1:       2737758005  27.38%
 * FE_UPWARD
     0:       4690392401  46.90%
     1:       5143728216  51.44%
     2:        165879383   1.66%
 * FE_DOWNWARD
     0:       4690333331  46.90%
     1:       5143794937  51.44%
     2:        165871732   1.66%
 * FE_TOWARDZERO
     0:       4690343071  46.90%
     1:       5143786761  51.44%
     2:        165870168   1.66%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                     112.9740       135.8640       -20.26%
x86_64v2                   111.8910       131.7590       -17.76%
x86_64v3                   108.2800        68.0935        37.11%
aarch64                     61.3759        49.2403        19.77%
power10                     42.4483        24.1943        43.00%

Latency                      master        patched   improvement
x86_64                     144.0090       167.9750       -16.64%
x86_64v2                   139.2690       167.1900       -20.05%
x86_64v3                   130.1320        96.9347        25.51%
aarch64                     66.8538        53.2747        20.31%
power10                     49.5076        29.6917        40.03%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella 140e802cb3 math: Move atanh internal data to separate file
The internal data definitions are moved to s_atanh_data.c.
It helps on ABIs that build the implementation multiple times for
ifunc optimizations, like x86_64.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella cb8d1575b6 math: Consolidate acosh and asinh internal table
The shared internal data definitions are consolidated in
s_asincosh_data.c.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella 79b70fc09f math: Use atanh from CORE-MATH
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-1, 1]
 * FE_TONEAREST
     0:       8180011860  81.80%
     1:       1819865257  18.20%
     2:           122883   0.00%
 * FE_UPWARDA
     0:       3903695744  39.04%
     1:       4992324465  49.92%
     2:       1096319340  10.96%
     3:          7660451   0.08%
 * FE_DOWNWARDA
     0:       3904555484  39.05%
     1:       4991970864  49.92%
     2:       1095447471  10.95%
     3:          8026181   0.08%
 * FE_TOWARDZERO
     0:       7070209165  70.70%
     1:       2908447434  29.08%
     2:         21343401   0.21%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      26.4969        22.4625       15.23%
x86_64v2                    26.0792        22.9822       11.88%
x86_64v3                    25.6357        22.2147       13.34%
aarch64                     20.2295        19.7001        2.62%
power10                     10.0986         9.3846        7.07%

Latency                      master        patched   improvement
x86_64                      80.2311        59.9745       25.25%
x86_64v2                    79.7010        61.4066       22.95%
x86_64v3                    78.2679        58.5804       25.15%
aarch64                     34.3959        28.1523       18.15%
power10                     23.2417        18.2694       21.39%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella 30e66b085c math: Use asinh from CORE-MATH
The current implementation precision shows the following accuracy, on
tthree different ranges ([-DBL_MAX, -10], [-10,10], and [10, DBL_MAX))
with 10e9 uniform randomly generated numbers for each range (first
column is the accuracy in ULP, with '0' being correctly rounded, second
is the number of samples with the corresponding precision):

* range [-DBL_MAX, -10]
 * FE_TONEAREST
     0:       5164019099  51.64%
     1:       4835980901  48.36%
 * FE_UPWARD
     1:       4836053540  48.36%
     2:       5163946460  51.64%
 * FE_DOWNWARD
     1:       5163926134  51.64%
     2:       4836073866  48.36%
 * FE_TOWARDZERO
     0:       5163937001  51.64%
     1:       4836062999  48.36%

* Range [-10, 10)
 * FE_TONEAREST
     0:       8679029381  86.79%
     1:       1320934581  13.21%
     2:            36038   0.00%
 * FE_UPWARD
     0:       3965704277  39.66%
     1:       4993616710  49.94%
     2:       1039680225  10.40%
     3:           998788   0.01%
 * FE_DOWNWARD
     0:       3965806523  39.66%
     1:       4993534438  49.94%
     2:       1039601726  10.40%
     3:          1057313   0.01%
 * FE_TOWARDZEROA
     0:       7734210130  77.34%
     1:       2261868439  22.62%
     2:          3921431   0.04%

* Range [10, DBL_MAX)
 * FE_TONEAREST
     0:       5163973212  51.64%
     1:       4836026788  48.36%
 * FE_UPWARD
     0:       4835991071  48.36%
     1:       5164008929  51.64%
 * FE_DOWNWARD
     0:       5163983594  51.64%
     1:       4836016406  48.36%
 * FE_TOWARDZERO
     0:       5163993394  51.64%
     1:       4836006606  48.36%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      26.5178        45.3754       -71.11%
x86_64v2                    26.3167        44.7870       -70.18%
x86_64v3                    25.9109        25.4887         1.63%
aarch64                     18.0555        17.3374         3.98%
power10                     19.8535        20.5586        -3.55%

Latency                      master        patched   improvement
x86_64                      82.6755        91.2026       -10.31%
x86_64v2                    82.4581        90.7152       -10.01%
x86_64v3                    80.7000        71.9454        10.85%
aarch64                     32.8320        28.8565        12.11%
power10                     44.5309        37.0096        16.89%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Adhemerval Zanella d1509f2ce3 math: Use acosh from CORE-MATH
The current implementation precision shows the following accuracy, on
two different ranges ([1,21) and [21, DBL_MAX)) with 10e9 uniform
randomly generated numbers (first column is the accuracy in ULP, with
'0' being correctly rounded, second is the number of samples with the
corresponding precision):

* range [1,21]

 * FE_TONEAREST
    0:       8931139411  89.31%
    1:       1068697545  10.69%
    2:           163044   0.00%
 * FE_UPWARD
    0:       7936620731  79.37%
    1:       2062594522  20.63%
    2:           783977   0.01%
    3:              770   0.00%
 * FE_DOWNWARD
    0:       7936459794  79.36%
    1:       2062734117  20.63%
    2:           805312   0.01%
    3:              777   0.00%
 * FE_TOWARDZERO
    0:       7910345595  79.10%
    1:       2088584522  20.89%
    2:          1069106   0.01%
    3:              777   0.00%

* Range [21, DBL_MAX)
 * FE_TONEAREST
    0:       5163888431  51.64%
    1:       4836111569  48.36%
 * FE_UPWARD
    0:       4835951885  48.36%
    1:       5164048115  51.64%
 * FE_DOWNWARD
    0:       5164048432  51.64%
    1:       4835951568  48.36%
 * FE_TOWARDZERO
    0:       5164058042  51.64%
    1:       4835941958  48.36%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master       patched   improvement
x86_64                      20.9131       47.2187      -125.79%
x86_64v2                    20.8823       41.1042       -96.84%
x86_64v3                    19.0282       25.8045       -35.61%
aarch64                     14.7419       18.1535       -23.14%
power10                     8.98341       11.0423       -22.92%

Latency                      master       patched   improvement
x86_64                      75.5494       89.5979      -18.60%
x86_64v2                    74.4443       87.6292      -17.71%
x86_64v3                    71.8558       70.7086        1.60%
aarch64                     30.3361       29.2709        3.51%
power10                     20.9263       19.2482        8.02%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-10-27 09:34:04 -03:00
Paul Zimmermann 48fde7b026 various fixes detected with -Wdouble-promotion
Changes with respect to v1:
- added comment in e_j1f.c to explain the use of float is enough
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-22 12:35:40 +02:00
Adhemerval Zanella 9d0b7ec87c math: Suppress clang -Wincompatible-library-redeclaration on s_llround
Clang issues:

  ../sysdeps/ieee754/dbl-64/s_llround.c:83:30: error: incompatible
  redeclaration of library function 'lround'
  [-Werror,-Wincompatible-library-redeclaration]
  libm_alias_double (__lround, lround)
                               ^
  ../sysdeps/ieee754/dbl-64/s_llround.c:83:30: note: 'lround' is a builtin
  with type 'long (double)'

Reviewed-by: Sam James <sam@gentoo.org>
2025-10-21 09:24:27 -03:00
Adhemerval Zanella 407b2eea75 math: use fabs on __ieee754_lgamma_r
clang issues:

  ../sysdeps/ieee754/dbl-64/e_lgamma_r.c:234:29: error: absolute value function 'fabsf'
  given an argument of type 'double' but has parameter of type 'float' which may cause \
  truncation of value [-Werror,-Wabsolute-value]

It should not matter because the value is 0.0, but using fabs is
simpler than adding a warning suppresion.

Reviewed-by: Sam James <sam@gentoo.org>
2025-10-21 09:24:24 -03:00
Adhemerval Zanella 76dfd91275 Suppress -Wmaybe-uninitialized only for gcc
The warning is not supported by clang.

Reviewed-by: Sam James <sam@gentoo.org>
2025-10-21 09:24:05 -03:00
Wilco Dijkstra 35807cc5cd math: Add builtin support for (l)lround(f)
Add builtin support for (l)lround(f) via the math-use-builtins
header mechanism.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-10-17 17:03:54 +00:00
Adhemerval Zanella 850d93f514 math: Use binary search on lgammaf slow path
And remove some unused entries of the fallback table.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-10-14 11:12:08 -03:00