Work around the clang limitation wrt inline function and attribute
definition, where it does not allow to 'add' new attribute if a
function is already defined:
clang on x86_64 fails to build s_fabsf128.c with:
../sysdeps/ieee754/float128/../ldbl-128/s_fabsl.c:32:1: error: attribute declaration must precede definition [-Werror,-Wignored-attributes]
32 | libm_alias_ldouble (__fabs, fabs)
| ^
../sysdeps/generic/libm-alias-ldouble.h:63:38: note: expanded from macro 'libm_alias_ldouble'
63 | #define libm_alias_ldouble(from, to) libm_alias_ldouble_r (from, to, )
| ^
../sysdeps/ieee754/float128/float128_private.h:133:43: note: expanded from macro 'libm_alias_ldouble_r'
133 | #define libm_alias_ldouble_r(from, to, r) libm_alias_float128_r (from, to, r)
| ^
../sysdeps/ieee754/float128/s_fabsf128.c:5:3: note: expanded from macro 'libm_alias_float128_r'
5 | static_weak_alias (from ## f128 ## r, to ## f128 ## r); \
| ^
./../include/libc-symbols.h:166:46: note: expanded from macro 'static_weak_alias'
166 | # define static_weak_alias(name, aliasname) weak_alias (name, aliasname)
| ^
./../include/libc-symbols.h:154:38: note: expanded from macro 'weak_alias'
154 | # define weak_alias(name, aliasname) _weak_alias (name, aliasname)
| ^
./../include/libc-symbols.h:156:52: note: expanded from macro '_weak_alias'
156 | extern __typeof (name) aliasname __attribute__ ((weak, alias (#name))) \
| ^
../include/math.h:134:1: note: previous definition is here
134 | fabsf128 (_Float128 x)
If compiler does not support __USE_EXTERN_INLINES we need to route
fabsf128 call to an internal symbol.
When we want to inline builtin math functions, like truncf, for
extern float truncf (float __x) __attribute__ ((__nothrow__ )) __attribute__ ((__const__));
extern float __truncf (float __x) __attribute__ ((__nothrow__ )) __attribute__ ((__const__));
float (truncf) (float) asm ("__truncf");
compiler may redirect truncf calls to __truncf, instead of inlining it
(for instance, clang). The USE_TRUNCF_BUILTIN is 1 to indicate that
truncf should be inlined. In this case, we don't want the truncf
redirection:
1. For each math function which may be inlined, we define
#if USE_TRUNCF_BUILTIN
# define NO_truncf_BUILTIN inline_truncf
#else
# define NO_truncf_BUILTIN truncf
#endif
in <math-use-builtins.h>.
2. Include <math-use-builtins.h> in include/math.h.
3. Change MATH_REDIRECT to
#define MATH_REDIRECT(FUNC, PREFIX, ARGS) \
float (NO_ ## FUNC ## f ## _BUILTIN) (ARGS (float)) \
asm (PREFIX #FUNC "f");
With this change If USE_TRUNCF_BUILTIN is 0, we get
float (truncf) (float) asm ("__truncf");
truncf will be redirected to __truncf.
And for USE_TRUNCF_BUILTIN 1, we get:
float (inline_truncf) (float) asm ("__truncf");
In both cases either truncf will be inlined or the internal alias
(__truncf) will be called.
It is not required for all math-use-builtin symbol, only the one
defined in math.h. It also allows to remove all the math-use-builtin
inclusion, since it is now implicitly included by math.h.
For MIPS, some math-use-builtin headers include sysdep.h and this
in turn includes a lot of extra headers that do not allow ldbl-128
code to override alias definition (math.h will include
some stdlib.h definition). The math-use-builtin only requires
the __mips_isa_rev, so move the defintion to sgidefs.h.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add fast path optimization for frexpl (80-bit x87 extended precision) using
a single unsigned comparison to identify normal floating-point numbers and
return immediately via arithmetic on the exponent field.
The implementation uses arithmetic operations (se - ex ) to
adjust the exponent directly, which is simpler than bit masking. For subnormals,
the traditional multiply-based normalization is retained as it handles the
split word format more reliably.
The zero/infinity/NaN check groups these special cases together for better
branch prediction.
Benchmark results on Intel Core i9-13900H (13th Gen):
Baseline: 25.543 ns/op
Optimized: 25.531 ns/op
Speedup: 1.00x (neutral)
Zero: 17.774 ns/op
Denormal: 23.900 ns/op
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
As discussed in bug 28327, C23 changed the fromfp functions to return
floating types instead of intmax_t / uintmax_t. (Although the
motivation in N2548 was reducing the use of intmax_t in library
interfaces, the new version does have the advantage of being able to
specify arbitrary integer widths for e.g. assigning the result to a
_BitInt, as well as being able to indicate an error case in-band with
a NaN return.)
As with other such changes from interfaces introduced in TS 18661,
implement the new types as a replacement for the old ones, with the
old functions remaining as compat symbols but not supported as an API.
The test generator used for many of the tests is updated to handle
both versions of the functions.
Tested for x86_64 and x86, and with build-many-glibcs.py.
Also tested tgmath tests for x86_64 with GCC 7 to make sure that the
modified case for older compilers in <tgmath.h> does work.
Also tested for powerpc64le to cover the ldbl-128ibm implementation
and the other things that are handled differently for that
configuration. The new tests fail for ibm128, but all the failures
relate to incorrect signs of zero results and turn out to arise from
bugs in the underlying roundl, ceill, truncl and floorl
implementations that I've reported in bug 33623, rather than
indicating any bug in the actual new implementation of the functions
for that format. So given fixes for those functions (which shouldn't
be hard, and of course should add to the tests for those functions
rather than relying only on indirect testing via fromfp), the fromfp
tests should start passing for ibm128 as well.
Remove uses of float_t and double_t. This is not useful on modern machines,
and does not help given GCC defaults to -fexcess-precision=fast.
One use of double_t remains to allow forcing the precision to double
on targets where FLT_EVAL_METHOD=2. This fixes BZ #33563 on
i486-pc-linux-gnu.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove ldbl-128/s_fma.c - it makes no sense to use emulated float128
operations to emulate FMA. Benchmarking shows dbl-64/s_fma.c is about
twice as fast. Remove redundant dbl-64/s_fma.c includes in targets
that were trying to work around this issue.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add fast path optimization for frexpl (128-bit IEEE quad precision) using
a single unsigned comparison to identify normal floating-point numbers and
return immediately via arithmetic on the exponent field.
The implementation uses arithmetic operations hx = hx - (ex << 48)
to adjust the exponent in place, which is simpler and more efficient than
bit masking. For subnormals, the traditional multiply-based normalization
is retained for reliability with the split 64-bit word format.
The zero/infinity/NaN check groups these special cases together for better
branch prediction.
This optimization provides the same algorithmic improvements as the other
frexp variants while maintaining correctness for all edge cases.
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Add fast path optimization for frexp using a single unsigned comparison
to identify normal floating-point numbers and return immediately via
arithmetic on the bit representation.
The implementation uses asuint64()/asdouble() from math_config.h and arithmetic
operations to adjust the exponent, which generates better code than bit masking
on ARM and RISC-V architectures. For subnormals, stdc_leading_zeros provides
faster normalization than the traditional multiply approach.
The zero/infinity/NaN check is simplified to (int64_t)(ix << 1) <= 0, which
is more efficient than separate comparisons.
Benchmark results on Intel Core i9-13900H (13th Gen):
Baseline: 6.778 ns/op
Optimized: 4.007 ns/op
Speedup: 1.69x (40.9% faster)
Zero: 3.580 ns/op (fast path)
Denormal: 6.096 ns/op (slower, rare case)
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add fast path optimization for frexpf using a single unsigned comparison
to identify normal floating-point numbers and return immediately via
arithmetic on the bit representation.
The implementation uses asuint()/asfloat() from math_config.h and arithmetic
operations to adjust the exponent, which generates better code than bit masking
on ARM and RISC-V architectures. For subnormals, stdc_leading_zeros provides
faster normalization than the traditional multiply approach.
The zero/infinity/NaN check is simplified to (int32_t)(hx << 1) <= 0, which
is more efficient than separate comparisons.
Benchmark results on Intel Core i9-13900H (13th Gen):
Baseline: 5.858 ns/op
Optimized: 4.003 ns/op
Speedup: 1.46x (31.7% faster)
Zero: 3.580 ns/op (fast path)
Denormal: 5.597 ns/op (slower, rare case)
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It improves latency for about 1.5% and throughput for about 2-4%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-6% and throughput for about 5-12%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
As discussed in bug 28327, the fromfp functions changed type in C23
(compared to the version in TS 18661-1); they now return the same type
as the floating-point argument, instead of intmax_t / uintmax_t.
As with other such incompatible changes compared to the initial TS
18661 versions of interfaces (the types of totalorder functions, in
particular), it seems appropriate to support only the new version as
an API, not the old one (although many programs written for the old
API might in fact work wtih the new one as well). Thus, the existing
implementations should become compat symbols. They are sufficiently
different from how I'd expect to implement the new version that using
separate implementations in separate files is more convenient than
trying to share code, and directly sharing testcases would be
problematic as well.
Rename the existing fromfp implementation and test files to names
reflecting how they're intended to become compat symbols, so freeing
up the existing filenames for a subsequent implementation of the C23
versions of these functions (which is the point at which the existing
implementations would actually become compat symbols).
gen-fromfp-tests.py and gen-fromfp-tests-inputs are not renamed; I
think it will make sense to adapt the test generator to be able to
generate most tests for both versions of the functions (with extra
test inputs added that are only of interest with the C23 version).
The ldbl-opt/nldbl-* files are also not renamed; since those are for a
static only library, no compat versions are needed, and they'll just
have their contents changed when the C23 version is implemented.
Tested for x86_64, and with build-many-glibcs.py.
i386 and m68k architectures should use math-use-builtins-sqrt.h rather
than relying on architecture-specific or inline assembly implementations.
The PowerPC optimization for PPC 601/603 (30 years old) is removed.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-10% and throughput for about 5-15%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and
gcc implements it through the builtin. This optimization enables
us to migrate the implementation to a C version. The performance
on a Zen3 chip is similar to the SVID one.
The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).
Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):
reciprocal-throughput input master NO-SVID improvement
x86_64 subnormals 18.8522 16.2506 13.80%
x86_64 normal 421.8260 403.9270 4.24%
x86_64 close-exponent 21.0579 18.7642 10.89%
i686 subnormals 21.3443 21.4229 -0.37%
i686 normal 525.8380 538.807 -2.47%
i686 close-exponent 21.6589 21.7983 -0.64%
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. This optimization enables us to
migrate the implementation to a C version. The performance on a Zen3
chip is similar to the SVID one.
The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin (different
than i386).
Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):
reciprocal-throughput input master NO-SVID improvement
x86_64 subnormals 17.5349 15.6125 10.96%
x86_64 normal 53.8134 52.5754 2.30%
x86_64 close-exponent 20.0211 18.6656 6.77%
i686 subnormals 21.8105 20.1856 7.45%
i686 normal 73.1945 71.2199 2.70%
i686 close-exponent 22.2141 20.331 8.48%
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Fix pow (DBL_MAX, 1.0) to return DBL_MAX when rouding upwards without FMA.
This fixes BZ #33563.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Fix powf (0x1.fffffep+127, 1.0f) to return 0x1.fffffep+127 when
rouding upwards. Cleanup the special case code - performance
improves by ~1.2%. This fixes BZ #33563.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It improves latency for about 3-10% and throughput for about 5-15%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 1-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-7% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 2% and throughput for about 5%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 2-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It improves latency for about 3-10% and throughput for about 5-10%.
Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
For lgamma and tgamma the muldd, mulddd, and polydd are renamed
to muldd2, mulddd2, and polydd2 respectively.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The common code definitions are consolidated in s_erf_common.h
and s_erf_common.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The shared internal data definitions are consolidated in
s_erf_data.c and the erfc only one are moved to s_erfc_data.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MIN, -4.2]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-4.2, 4.2]
* FE_TONEAREST
0: 9764404513 97.64%
1: 235595487 2.36%
* FE_UPWARD
0: 9468013928 94.68%
1: 531986072 5.32%
* FE_DOWNWARD
0: 9493787693 94.94%
1: 506212307 5.06%
* FE_TOWARDZERO
0: 9585271351 95.85%
1: 414728649 4.15%
* Range [4.2, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 38.2754 78.0311 -103.87%
x86_64v2 38.3325 75.7555 -97.63%
x86_64v3 34.6604 28.3182 18.30%
aarch64 23.1499 21.4307 7.43%
power10 12.3051 9.3766 23.80%
Latency master patched improvement
x86_64 84.3062 121.3580 -43.95%
x86_64v2 84.1817 117.4250 -39.49%
x86_64v3 81.0933 70.6458 12.88%
aarch64 35.012 29.5012 15.74%
power10 21.7205 18.4589 15.02%
For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The internal data definitions are moved to s_atanh_data.c.
It helps on ABIs that build the implementation multiple times for
ifunc optimizations, like x86_64.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):
* Range [-1, 1]
* FE_TONEAREST
0: 8180011860 81.80%
1: 1819865257 18.20%
2: 122883 0.00%
* FE_UPWARDA
0: 3903695744 39.04%
1: 4992324465 49.92%
2: 1096319340 10.96%
3: 7660451 0.08%
* FE_DOWNWARDA
0: 3904555484 39.05%
1: 4991970864 49.92%
2: 1095447471 10.95%
3: 8026181 0.08%
* FE_TOWARDZERO
0: 7070209165 70.70%
1: 2908447434 29.08%
2: 21343401 0.21%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 26.4969 22.4625 15.23%
x86_64v2 26.0792 22.9822 11.88%
x86_64v3 25.6357 22.2147 13.34%
aarch64 20.2295 19.7001 2.62%
power10 10.0986 9.3846 7.07%
Latency master patched improvement
x86_64 80.2311 59.9745 25.25%
x86_64v2 79.7010 61.4066 22.95%
x86_64v3 78.2679 58.5804 25.15%
aarch64 34.3959 28.1523 18.15%
power10 23.2417 18.2694 21.39%
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
Changes with respect to v1:
- added comment in e_j1f.c to explain the use of float is enough
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Clang issues:
../sysdeps/ieee754/dbl-64/s_llround.c:83:30: error: incompatible
redeclaration of library function 'lround'
[-Werror,-Wincompatible-library-redeclaration]
libm_alias_double (__lround, lround)
^
../sysdeps/ieee754/dbl-64/s_llround.c:83:30: note: 'lround' is a builtin
with type 'long (double)'
Reviewed-by: Sam James <sam@gentoo.org>
clang issues:
../sysdeps/ieee754/dbl-64/e_lgamma_r.c:234:29: error: absolute value function 'fabsf'
given an argument of type 'double' but has parameter of type 'float' which may cause \
truncation of value [-Werror,-Wabsolute-value]
It should not matter because the value is 0.0, but using fabs is
simpler than adding a warning suppresion.
Reviewed-by: Sam James <sam@gentoo.org>
And remove some unused entries of the fallback table.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>