On GCC 11 (x86-64), the previous code produced test failures like
this one:
Failure: Test: exp10m1_towardzero (-0x1.1p+4)
Result:
is: -1.00000000e+00 -0x1.000000p+0
should be: -9.99999940e-01 -0x1.fffffep-1
difference: 5.96046447e-08 0x1.000000p-24
ulp : 1.0000
max.ulp : 0.0000
Apply a similar fix to exp2m1f.
Co-authored-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The CORE-MATH exp2m1f implementation showed slight worse latency
when using x86_64 baseline ABI. This patch adds a ifunc variant
with similar performance for x86_64-v3.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH exp10m1f implementation showed slight worse latency
when using x86_64 baseline ABI. This patch adds a ifunc variant
with similar performance for x86_64-v3.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.
The benchmark inputs are based on exp2f ones.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6042 48.7104 -19.96%
x86_64v2 40.7506 35.9032 11.90%
x86_64v3 35.2301 31.7956 9.75%
i686 102.094 94.6657 7.28%
aarch64 18.2704 15.1387 17.14%
power10 11.9444 8.2402 31.01%
reciprocal-throughput master patched improvement
x86_64 20.8683 16.1428 22.64%
x86_64v2 19.5076 10.4474 46.44%
x86_64v3 19.2106 10.4014 45.86%
i686 56.4054 59.3004 -5.13%
aarch64 12.0781 7.3953 38.77%
power10 6.5306 5.9388 9.06%
The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded). This explains the performance difference for x86_64.
Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'. When built wth same
flags, the new algorithm shows a better performance:
master patched improvement
latency 102.094 91.2823 10.59%
reciprocal-throughput 56.4054 52.7984 6.39%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 45.4690 49.5845 -9.05%
x86_64v2 46.1604 36.2665 21.43%
x86_64v3 37.8442 31.0359 17.99%
i686 121.367 93.0079 23.37%
aarch64 21.1126 15.0165 28.87%
power10 12.7426 8.4929 33.35%
reciprocal-throughput master patched improvement
x86_64 19.6005 17.4005 11.22%
x86_64v2 19.6008 11.1977 42.87%
x86_64v3 17.5427 10.2898 41.34%
i686 59.4215 60.9675 -2.60%
aarch64 13.9814 7.9173 43.37%
power10 6.7814 6.4258 5.24%
The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
Also remove the use of builtins in favor of standard names, compiler
already inline them (if supported) with current compiler options.
It also fixes and issue where __builtin_roundeven is not support on
gcc older than version 10.
Checked on x86_64-linux-gnu and i686-linux_gnu.
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.
Tested on x86_64 and x86 (cfarm26).
With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:
"tgammaf": {
"": {
"duration": 3.50188e+09,
"iterations": 2e+07,
"max": 602.891,
"min": 65.1415,
"mean": 175.094
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.30825e+09,
"iterations": 5e+07,
"max": 211.592,
"min": 32.0325,
"mean": 66.1649
}
}
With the initial GNU libc code it gave on cfarm26 (i686):
"tgammaf": {
"": {
"duration": 3.70505e+09,
"iterations": 6e+06,
"max": 2420.23,
"min": 243.154,
"mean": 617.509
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.24497e+09,
"iterations": 1.8e+07,
"max": 1238.15,
"min": 101.155,
"mean": 180.276
}
}
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Changes in v2:
- include <math.h> (fix the linknamespace failures)
- restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
- restored original wrapper code (math/w_tgammaf_compat.c),
except for the dealing with the sign
- removed the tgammaf/float entries in all libm-test-ulps files
- address other comments from Joseph Myers
(https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)
Changes in v3:
- pass NULL argument for signgam from w_tgammaf_compat.c
- use of math_narrow_eval
- added more comments
Changes in v4:
- initialize local_signgam to 0 in math/w_tgamma_template.c
- replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file
Changes in v5:
- do not mention local_signgam any more in math/w_tgammaf_compat.c
- initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
and added comment
Changes in v6:
- pass NULL as 2nd argument of __ieee754_gammaf_r in
w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c
Changes in v7:
- added Signed-off-by line for Alexei Sibidanov (author of the code)
Changes in v8:
- added Signed-off-by line for Paul Zimmermann (submitted of the patch)
Changes in v9:
- address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
As discussed at the patch review meeting
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the exp2m1 and exp10m1 functions (exp2(x)-1 and
exp10(x)-1, like expm1).
As with other such functions, these use type-generic templates that
could be replaced with faster and more accurate type-specific
implementations in future. Test inputs are copied from those for
expm1, plus some additions close to the overflow threshold (copied
from exp2 and exp10) and also some near the underflow threshold.
exp2m1 has the unusual property of having an input (M_MAX_EXP) where
whether the function overflows (under IEEE semantics) depends on the
rounding mode. Although these could reasonably be XFAILed in the
testsuite (as we do in some cases for arguments very close to a
function's overflow threshold when an error of a few ulps in the
implementation can result in the implementation not agreeing with an
ideal one on whether overflow takes place - the testsuite isn't smart
enough to handle this automatically), since these functions aren't
required to be correctly rounding, I made the implementation check for
and handle this case specially.
The Makefile ordering expected by lint-makefiles for the new functions
is a bit peculiar, but I implemented it in this patch so that the test
passes; I don't know why log2 also needed moving in one Makefile
variable setting when it didn't in my previous patches, but the
failure showed a different place was expected for that function as
well.
The powerpc64le IFUNC setup seems not to be as self-contained as one
might hope; it shouldn't be necessary to add IFUNCs for new functions
such as these simply to get them building, but without setting up
IFUNCs for the new functions, there were undefined references to
__GI___expm1f128 (that IFUNC machinery results in no such function
being defined, but doesn't stop include/math.h from doing the
redirection resulting in the exp2m1f128 and exp10m1f128
implementations expecting to call it).
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log10p1 functions (log10(1+x): like log1p, but for
base-10 logarithms).
This is directly analogous to the log2p1 implementation (except that
whereas log2p1 has a smaller underflow range than log1p, log10p1 has a
larger underflow range). The test inputs are copied from those for
log1p and log2p1, plus a few more inputs in that wider underflow
range.
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p). As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.
Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.
The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either). It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately. For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).
Tested for x86_64 and x86, and with build-many-glibcs.py.
Left shift of ki is undefined when ki<0, copy the logic from exp,
which uses unsigned arithmetics, to fix it.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The libc.a for alpha, s390, and sparcv9 does not provide
copysignf64x, copysignf128, frexpf64x, frexpf128, modff64x, and
modff128.
Checked with a static build for the affected ABIs.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Don't provide __nexttowardf128_do_not_use, nexttowardf128_do_not_use,
finitef128_do_not_use, isinff128_do_not_use and isnanf128_do_not_use.
This fixes BZ #31757.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Some static implementation of float128 routines might call __isnanf128,
which is not provided by the static object.
Checked on x86_64-linux-gnu.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The commit 16439f419b removed the static fmod/fmodf on i386 and m68k
with and empty w_fmod.c (required for the ABIs that uses the newly
implementation). This patch fixes by adding the required symbols on
the arch-specific w_fmod{f}_compat.c implementation.
To statically build fmod fails on some ABI (alpha, s390, sparc) because
it does not export the ldexpf128, this is also fixed by this patch.
Checked on i686-linux-gnu and with a build for m68k-linux-gnu.
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the log2p1 functions (log2(1+x): like log1p, but for
base-2 logarithms).
This illustrates the intended structure of implementations of all
these function families: define them initially with a type-generic
template implementation. If someone wishes to add type-specific
implementations, it is likely such implementations can be both faster
and more accurate than the type-generic one and can then override it
for types for which they are implemented (adding benchmarks would be
desirable in such cases to demonstrate that a new implementation is
indeed faster).
The test inputs are copied from those for log1p. Note that these
changes make gen-auto-libm-tests depend on MPFR 4.2 (or later).
The bulk of the changes are fairly generic for any such new function.
(sysdeps/powerpc/nofpu/Makefile only needs changing for those
type-generic templates that use fabs.)
Tested for x86_64 and x86, and with build-many-glibcs.py.
Fix BZ #31759 by not defining nearbyint aliases when used in IFUNC.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Complete the internal renaming from "C2X" and related names in GCC by
renaming *-c2x and *-gnu2x tests to *-c23 and *-gnu23.
Tested for x86_64, and with build-many-glibcs.py for powerpc64le.
WG14 decided to use the name C23 as the informal name of the next
revision of the C standard (notwithstanding the publication date in
2024). Update references to C2X in glibc to use the C23 name.
This is intended to update everything *except* where it involves
renaming files (the changes involving renaming tests are intended to
be done separately). In the case of the _ISOC2X_SOURCE feature test
macro - the only user-visible interface involved - support for that
macro is kept for backwards compatibility, while adding
_ISOC23_SOURCE.
Tested for x86_64.
Remove the error handling wrapper from exp10. This is very similar to
the changes done to exp and exp2, except that we also need to handle
pow10 and pow10l.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
New implementation is based on the existing exp/exp2, with different
reduction constants and polynomial. Worst-case error in round-to-
nearest is 0.513 ULP.
The exp/exp2 shared table is reused for exp10 - .rodata size of
e_exp_data increases by 64 bytes.
As for exp/exp2, targets with single-instruction rounding/conversion
intrinsics can use them by toggling TOINT_INTRINSICS=1 and adding the
necessary code to their math_private.h.
Improvements on Neoverse V1 compared to current GLIBC master:
exp10 thruput: 3.3x in [-0x1.439b746e36b52p+8 0x1.34413509f79ffp+8]
exp10 latency: 1.8x in [-0x1.439b746e36b52p+8 0x1.34413509f79ffp+8]
Tested on:
aarch64-linux-gnu (TOINT_INTRINSICS, fma contraction) and
x86_64-linux-gnu (!TOINT_INTRINSICS, no fma contraction)
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
With GCC 14 on 32-bit x86 the compiler emits a maybe-uninitialized
warning:
../sysdeps/ieee754/dbl-64/k_rem_pio2.c: In function '__kernel_rem_pio2':
../sysdeps/ieee754/dbl-64/k_rem_pio2.c:364:20: error: 'fq' may be used uninitialized [-Werror=maybe-uninitialized]
364 | y[0] = fq[0]; y[1] = fq[1]; y[2] = fw;
| ~~^~~
This is similar to the warning that is suppressed in the other branch of
the switch. Help the compiler knowing that the variable is always
initialized, which also makes the suppression obsolete.
On Skylake, it changes log1p bench performance by:
Before After Improvement
max 63.349 58.347 8%
min 4.448 5.651 -30%
mean 12.0674 10.336 14%
The minimum code path is
if (hx < 0x3FDA827A) /* x < 0.41422 */
{
if (__glibc_unlikely (ax >= 0x3ff00000)) /* x <= -1.0 */
{
...
}
if (__glibc_unlikely (ax < 0x3e200000)) /* |x| < 2**-29 */
{
math_force_eval (two54 + x); /* raise inexact */
if (ax < 0x3c900000) /* |x| < 2**-54 */
{
...
}
else
return x - x * x * 0.5;
FMA and non-FMA code sequences look similar. Non-FMA version is slightly
faster. Since log1p is called by asinh and atanh, it improves asinh
performance by:
Before After Improvement
max 75.645 63.135 16%
min 10.074 10.071 0%
mean 15.9483 14.9089 6%
and improves atanh performance by:
Before After Improvement
max 91.768 75.081 18%
min 15.548 13.883 10%
mean 18.3713 16.8011 8%
On Skylake, it improves expm1 bench performance by:
Before After Improvement
max 70.204 68.054 3%
min 20.709 16.2 22%
mean 22.1221 16.7367 24%
NB: Add
extern long double __expm1l (long double);
extern long double __expm1f128 (long double);
for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.
Bump autoconf requirement to 2.71 to allow regenerating configure on
more recent distributions. autoconf 2.71 has been in Fedora since F36
and is the current version in Debian stable (bookworm). It appears to
be current in Gentoo as well.
All sysdeps configure and preconfigure scripts have also been
regenerated; all changes are trivial transformations that do not affect
functionality.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Return value from *scanf and *asprintf routines are now properly checked
in test-scanf-ldbl-compat-template.c and test-printf-ldbl-compat.c.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
This allows to include bits/syslog-decl.h in include/sys/syslog.h and
therefore be able to create the libc_hidden_builtin_proto (__syslog_chk)
prototype.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
If libc_hidden_builtin_{def,proto} isn't properly set for *_chk routines,
there are unwanted PLT entries in libc.so.
There is a special case with __asprintf_chk:
If ldbl_* macros are used for asprintf, ABI gets broken on s390x,
if it isn't, ppc64le isn't building due to multiple asm redirections.
This is due to the inclusion of bits/stdio-lbdl.h for ppc64le whereas it
isn't for s390x. This header creates redirections, which are not
compatible with the ones generated using libc_hidden_def.
Yet, we can't use libc_hidden_ldbl_proto on s390x since it will not
create a simple strong alias (e.g. as done on x86_64), but a versioned
alias, leading to ABI breakage.
This results in errors on s390x:
/usr/bin/ld: glibc/iconv/../libio/bits/stdio2.h:137: undefined reference
to `__asprintf_chk'
Original __asprintf_chk symbols:
00000000001395b0 T __asprintf_chk
0000000000177e90 T __nldbl___asprintf_chk
__asprintf_chk symbols with ldbl_* macros:
000000000012d590 t ___asprintf_chk
000000000012d590 t __asprintf_chk@@GLIBC_2.4
000000000012d590 t __GI___asprintf_chk
000000000012d590 t __GL____asprintf_chk___asprintf_chk
0000000000172240 T __nldbl___asprintf_chk
__asprintf_chk symbols with the patch:
000000000012d590 t ___asprintf_chk
000000000012d590 T __asprintf_chk
000000000012d590 t __GI___asprintf_chk
0000000000172240 T __nldbl___asprintf_chk
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The *_chk routines naming doesn't match the name that would be generated
using libc_hidden_ldbl_proto. Since the macro is needed for some of
these *_chk functions for _FORTIFY_SOURCE to be enabled, that needed to
be fixed.
While at it, all the *_chk function get renamed appropriately for
consistency, even if not strictly necessary.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
Since the _FORTIFY_SOURCE feature uses some routines of Glibc, they need to
be excluded from the fortification.
On top of that:
- some tests explicitly verify that some level of fortification works
appropriately, we therefore shouldn't modify the level set for them.
- some objects need to be build with optimization disabled, which
prevents _FORTIFY_SOURCE to be used for them.
Assembler files that implement architecture specific versions of the
fortified routines were not excluded from _FORTIFY_SOURCE as there is no
C header included that would impact their behavior.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Replace the loop-over-scalar placeholder routines with optimised
implementations from Arm Optimized Routines (AOR).
Also add some headers containing utilities for aarch64 libmvec
routines, and update libm-test-ulps.
Data tables for new routines are used via a pointer with a
barrier on it, in order to prevent overly aggressive constant
inlining in GCC. This allows a single adrp, combined with offset
loads, to be used for every constant in the table.
Special-case handlers are marked NOINLINE in order to confine the
save/restore overhead of switching from vector to normal calling
standard. This way we only incur the extra memory access in the
exceptional cases. NOINLINE definitions have been moved to
math_private.h in order to reduce duplication.
AOR exposes a config option, WANT_SIMD_EXCEPT, to enable
selective masking (and later fixing up) of invalid lanes, in
order to trigger fp exceptions correctly (AdvSIMD only). This is
tested and maintained in AOR, however it is configured off at
source level here for performance reasons. We keep the
WANT_SIMD_EXCEPT blocks in routine sources to greatly simplify
the upstreaming process from AOR to glibc.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
This patch redirects the error functions to the appropriate
longdouble variants which enables the compiler to optimize
for the abi ieeelongdouble.
Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Optimize the fast paths (x < y) and (x/y < 2^12). Delay handling of special
cases to reduce the number of instructions executed before the fast paths.
Performance improvements for fmod:
Skylake Zen2 Neoverse V1
subnormals 11.8% 4.2% 11.5%
normal 3.9% 0.01% -0.5%
close-exponents 6.3% 5.6% 19.4%
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The error handling is moved to sysdeps/ieee754 version with no SVID
support. The compatibility symbol versions still use the wrapper
with SVID error handling around the new code. There is no new symbol
version nor compatibility code on !LIBM_SVID_COMPAT targets
(e.g. riscv).
The ia64 is unchanged, since it still uses the arch specific
__libm_error_region on its implementation. For both i686 and m68k,
which provive arch specific implementation, wrappers are added so
no new symbol are added (which would require to change the
implementations).
It shows an small improvement, the results for fmod:
Architecture | Input | master | patch
-----------------|-----------------|----------|--------
x86_64 (Ryzen 9) | subnormals | 12.5049 | 9.40992
x86_64 (Ryzen 9) | normal | 296.939 | 296.738
x86_64 (Ryzen 9) | close-exponents | 16.0244 | 13.119
aarch64 (N1) | subnormal | 6.81778 | 4.33313
aarch64 (N1) | normal | 155.620 | 152.915
aarch64 (N1) | close-exponents | 8.21306 | 5.76138
armhf (N1) | subnormal | 15.1083 | 14.5746
armhf (N1) | normal | 244.833 | 241.738
armhf (N1) | close-exponents | 21.8182 | 22.457
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
This uses a new algorithm similar to already proposed earlier [1].
With x = mx * 2^ex and y = my * 2^ey (mx, my, ex, ey being integers),
the simplest implementation is:
mx * 2^ex == 2 * mx * 2^(ex - 1)
while (ex > ey)
{
mx *= 2;
--ex;
mx %= my;
}
With mx/my being mantissa of double floating pointer, on each step the
argument reduction can be improved 8 (which is sizeof of uint32_t minus
MANTISSA_WIDTH plus the signal bit):
while (ex > ey)
{
mx << 8;
ex -= 8;
mx %= my;
} */
The implementation uses builtin clz and ctz, along with shifts to
convert hx/hy back to doubles. Different than the original patch,
this path assume modulo/divide operation is slow, so use multiplication
with invert values.
I see the following performance improvements using fmod benchtests
(result only show the 'mean' result):
Architecture | Input | master | patch
-----------------|-----------------|----------|--------
x86_64 (Ryzen 9) | subnormals | 17.2549 | 12.0318
x86_64 (Ryzen 9) | normal | 85.4096 | 49.9641
x86_64 (Ryzen 9) | close-exponents | 19.1072 | 15.8224
aarch64 (N1) | subnormal | 10.2182 | 6.81778
aarch64 (N1) | normal | 60.0616 | 20.3667
aarch64 (N1) | close-exponents | 11.5256 | 8.39685
I also see similar improvements on arm-linux-gnueabihf when running on
the N1 aarch64 chips, where it a lot of soft-fp implementation (for
modulo, and multiplication):
Architecture | Input | master | patch
-----------------|-----------------|----------|--------
armhf (N1) | subnormal | 11.6662 | 10.8955
armhf (N1) | normal | 69.2759 | 34.1524
armhf (N1) | close-exponents | 13.6472 | 18.2131
Instead of using the math_private.h definitions, I used the
math_config.h instead which is used on newer math implementations.
Co-authored-by: kirill <kirill.okhotnikov@gmail.com>
[1] https://sourceware.org/pipermail/libc-alpha/2020-November/119794.html
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
This uses a new algorithm similar to already proposed earlier [1].
With x = mx * 2^ex and y = my * 2^ey (mx, my, ex, ey being integers),
the simplest implementation is:
mx * 2^ex == 2 * mx * 2^(ex - 1)
while (ex > ey)
{
mx *= 2;
--ex;
mx %= my;
}
With mx/my being mantissa of double floating pointer, on each step the
argument reduction can be improved 11 (which is sizeo of uint64_t minus
MANTISSA_WIDTH plus the signal bit):
while (ex > ey)
{
mx << 11;
ex -= 11;
mx %= my;
} */
The implementation uses builtin clz and ctz, along with shifts to
convert hx/hy back to doubles. Different than the original patch,
this path assume modulo/divide operation is slow, so use multiplication
with invert values.
I see the following performance improvements using fmod benchtests
(result only show the 'mean' result):
Architecture | Input | master | patch
-----------------|-----------------|----------|--------
x86_64 (Ryzen 9) | subnormals | 19.1584 | 12.5049
x86_64 (Ryzen 9) | normal | 1016.51 | 296.939
x86_64 (Ryzen 9) | close-exponents | 18.4428 | 16.0244
aarch64 (N1) | subnormal | 11.153 | 6.81778
aarch64 (N1) | normal | 528.649 | 155.62
aarch64 (N1) | close-exponents | 11.4517 | 8.21306
I also see similar improvements on arm-linux-gnueabihf when running on
the N1 aarch64 chips, where it a lot of soft-fp implementation (for
modulo, clz, ctz, and multiplication):
Architecture | Input | master | patch
-----------------|-----------------|----------|--------
armhf (N1) | subnormal | 15.908 | 15.1083
armhf (N1) | normal | 837.525 | 244.833
armhf (N1) | close-exponents | 16.2111 | 21.8182
Instead of using the math_private.h definitions, I used the
math_config.h instead which is used on newer math implementations.
Co-authored-by: kirill <kirill.okhotnikov@gmail.com>
[1] https://sourceware.org/pipermail/libc-alpha/2020-November/119794.html
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
They are both used by __libc_freeres to free all library malloc
allocated resources to help tooling like mtrace or valgrind with
memory leak tracking.
The current scheme uses assembly markers and linker script entries
to consolidate the free routine function pointers in the RELRO segment
and to be freed buffers in BSS.
This patch changes it to use specific free functions for
libc_freeres_ptrs buffers and call the function pointer array directly
with call_function_static_weak.
It allows the removal of both the internal macros and the linker
script sections.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
C2x adds binary integer constants starting with 0b or 0B, and supports
those constants for the %i scanf format (in addition to the %b format,
which isn't yet implemented for scanf in glibc). Implement that scanf
support for glibc.
As with the strtol support, this is incompatible with previous C
standard versions, in that such an input string starting with 0b or 0B
was previously required to be parsed as 0 (with the rest of the input
potentially matching subsequent parts of the scanf format string).
Thus this patch adds 12 new __isoc23_* functions per long double
format (12, 24 or 36 depending on how many long double formats the
glibc configuration supports), with appropriate header redirection
support (generally very closely following that for the __isoc99_*
scanf functions - note that __GLIBC_USE (DEPRECATED_SCANF) takes
precedence over __GLIBC_USE (C2X_STRTOL), so the case of GNU
extensions to C89 continues to get old-style GNU %a and does not get
this new feature). The function names would remain as __isoc23_* even
if C2x ends up published in 2024 rather than 2023.
When scanf %b support is added, I think it will be appropriate for all
versions of scanf to follow C2x rules for inputs to the %b format
(given that there are no compatibility concerns for a new format).
Tested for x86_64 (full glibc testsuite). The first version was also
tested for powerpc (32-bit) and powerpc64le (stdio-common/ and wcsmbs/
tests), and with build-many-glibcs.py.
The patch suppress the same warnings from 87c266d758,
that shows issues for microblaze, mips soft-fp, nios2, and or1k.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Fix the following issues with built-in function use in
sysdeps/ieee754/ldbl-128 and sysdeps/ieee754/float128:
* fabsl used __builtin_fabsf128 unconditionally, breaking the build
with GCC 6 for several architectures; it should use __builtin_fabsl
with an appropriate redirection in float128_private.h. (I'm not
particularly concerned with building glibc with GCC 6; rather, I
want to be able to run the tgmath.h tests with GCC 6, which is a
significantly different case for tgmath.h compared to GCC 7 and
later because of the lack of _FloatN / _FloatNx support in the
compiler, and at present running the tests with a compiler means
building glibc with that compiler.)
* Some (conditional) uses of built-in functions had been added to
ldbl-128 without appropriate float128_private.h remapping (there was
remapping for the macros controlling whether the built-in functions
are used, just not for the functions themselves).
* s_llrintl.c called __builtin_round not __builtin_llrintl, which is
obviously wrong.
Tested with build-many-glibcs.py for aarch64-linux-gnu, GCC 6 (where
it fixes the glibc build) and GCC 12, and with the glibc testsuite for
x86_64.
vfprintf is entangled with vfwprintf (of course), __printf_fp,
__printf_fphex, __vstrfmon_l_internal, and the strfrom family of
functions. The latter use the internal snprintf functionality,
so vsnprintf is converted as well.
The simples conversion is __printf_fphex, followed by
__vstrfmon_l_internal and __printf_fp, and finally
__vfprintf_internal and __vfwprintf_internal. __vsnprintf_internal
and strfrom* are mostly consuming the new interfaces, so they
are comparatively simple.
__printf_fp is a public symbol, so the FILE *-based interface
had to preserved.
The __printf_fp rewrite does not change the actual binary-to-decimal
conversion algorithm, and digits are still not emitted directly to
the target buffer. However, the staging buffer now uses bytes
instead of wide characters, and one buffer copy is eliminated.
The changes are at least performance-neutral in my testing.
Floating point printing and snprintf improved measurably, so that
this Lua script
for i=1,5000000 do
print(i, i * math.pi)
end
runs about 5% faster for me. To preserve fprintf performance for
a simple "%d" format, this commit has some logic changes under
LABEL (unsigned_number) to avoid additional function calls. There
are certainly some very easy performance improvements here: binary,
octal and hexadecimal formatting can easily avoid the temporary work
buffer (the number of digits can be computed ahead-of-time using one
of the __builtin_clz* built-ins). Decimal formatting can use a
specialized version of _itoa_word for base 10.
The existing (inconsistent) width handling between strfmon and printf
is preserved here. __print_fp_buffer_1 would have to use
__translated_number_width to achieve ISO conformance for printf.
Test expectations in libio/tst-vtables-common.c are adjusted because
the internal staging buffer merges all virtual function calls into
one.
In general, stack buffer usage is greatly reduced, particularly for
unbuffered input streams. __printf_fp can still use a large buffer
in binary128 mode for %g, though.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This patch is using the corresponding GCC builtin for logbf, logb,
logbl and logbf128 if the USE_FUNCTION_BUILTIN macros are defined to one
in math-use-builtins-function.h.
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
This patch is using the corresponding GCC builtin for llrintf, llrint,
llrintl and llrintf128 if the USE_FUNCTION_BUILTIN macros are defined to one
in math-use-builtins-function.h.
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
This patch is using the corresponding GCC builtin for lrintf, lrint,
lrintl and lrintf128 if the USE_FUNCTION_BUILTIN macros are defined to one
in math-use-builtins-function.h.
Co-Authored-By: Xi Ruoyao <xry111@xry111.site>
GCC 13 has added more _FloatN and _FloatNx versions of existing
<math.h> and <complex.h> built-in functions, for use in libstdc++-v3.
This breaks the glibc build because of how those functions are defined
as aliases to functions with the same ABI but different types. Add
appropriate -fno-builtin-* options for compiling relevant files, as
already done for the case of long double functions aliasing double
ones and based on the list of files used there.
I fixed some mistakes in that list of double files that I noticed
while implementing this fix, but there may well be more such
(harmless) cases, in this list or the new one (files that don't
actually exist or don't define the named functions as aliases so don't
need the options). I did try to exclude cases where glibc doesn't
define certain functions for _FloatN or _FloatNx types at all from the
new uses of -fno-builtin-* options. As with the options for double
files (see the commit message for commit
49348beafe, "Fix build with GCC 10 when
long double = double."), it's deliberate that the options are used
even if GCC currently doesn't have a built-in version of a given
functions, so providing some level of future-proofing against more
such built-in functions being added in future.
Tested with build-many-glibcs.py for aarch64-linux-gnu
powerpc-linux-gnu powerpc64le-linux-gnu x86_64-linux-gnu (compilers
and glibcs builds) with GCC mainline.
Detecting an overflow edge case depended on signed overflow of a long
long. Replace the additions and the overflow checks by
__builtin_add_overflow().
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
Avoid moving code across SET_RESTORE_ROUNDL in order to fix
[BZ #29463].
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@linux.ibm.com>
This works around a gcc issue where it const folded inf/inf into nan,
preventing the invalid exception to be signalled.
(x-x)/(x-x) is more robust against optimizations and works for all
out of bounds values including x==nan.
The gcc issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95115
should be fixed on release branches starting from gcc-10, but it is
better to change the code in case glibc is built with older gcc.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
GCC 13 adds support for _FloatN and _FloatNx types in C++, so breaking
the installed glibc headers that assume such support is not present.
GCC mostly works around this with fixincludes, but that doesn't help
for building glibc and its tests (glibc doesn't itself contain C++
code, but there's C++ code built for tests). Update glibc's
bits/floatn-common.h and bits/floatn.h headers to handle the GCC 13
support directly.
In general the changes match those made by fixincludes, though I think
the ones in sysdeps/powerpc/bits/floatn.h, where the header tests
__LDBL_MANT_DIG__ == 113 or uses #elif, wouldn't match the existing
fixincludes patterns.
Some places involving special C++ handling in relation to _FloatN
support are not changed. There's no need to change the
__HAVE_FLOATN_NOT_TYPEDEF definition (also in a form that wouldn't be
matched by the fixincludes fixes) because it's only used in relation
to macro definitions using features not supported for C++
(__builtin_types_compatible_p and _Generic). And there's no need to
change the inline function overloads for issignaling, iszero and
iscanonical in C++ because cases where types have the same format but
are no longer compatible types are handled automatically by the C++
overload resolution rules.
This patch also does not change the overload handling for iseqsig, and
there I think changes *are* needed, beyond those in this patch or made
by fixincludes. The way that overload is defined, via a template
parameter to a structure type, requires overloads whenever the types
are incompatible, even if they have the same format. So I think we
need to add overloads with GCC 13 for every supported _FloatN and
_FloatNx type, rather than just having one for _Float128 when it has a
different ABI to long double as at present (but for older GCC, such
overloads must not be defined for types that end up defined as
typedefs for another type).
Tested with build-many-glibcs.py: compilers build for
aarch64-linux-gnu ia64-linux-gnu mips64-linux-gnu powerpc-linux-gnu
powerpc64le-linux-gnu x86_64-linux-gnu; glibcs build for
aarch64-linux-gnu ia64-linux-gnu i686-linux-gnu mips-linux-gnu
mips64-linux-gnu-n32 powerpc-linux-gnu powerpc64le-linux-gnu
x86_64-linux-gnu.
math/test-float128-y1 fails on x86_64 and ppc64el with gcc 12 and -O3,
because code inside a block guarded by SET_RESTORE_ROUNDL is being moved
after the rounding mode has been restored. Use math_force_eval to
prevent this (and insert some math_opt_barrier calls to prevent code
from being moved before the rounding mode is set).
Fixes#29463
Reviewed-By: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The compiler may substitute calls to sin or cos with calls to sincos, thus
we should have the same optimized implementations for sincos. The
optimized implementations may produce results that differ, that also makes
sure that the sincos call aggrees with the sin and cos calls.
Both float, double, and _Float128 are assumed to be supported
(float and double already only uses builtins). Only long double
is parametrized due GCC bug 29253 which prevents its usage on
powerpc.
It allows to remove i686, ia64, x86_64, powerpc, and sparc arch
specific implementation.
On ia64 it also fixes the sNAN handling:
math/test-float64x-fabs
math/test-ldouble-fabs
Checked on x86_64-linux-gnu, i686-linux-gnu, powerpc-linux-gnu,
powerpc64-linux-gnu, sparc64-linux-gnu, and ia64-linux-gnu.
Converting double precision constants to float is now affected by the
runtime dynamic rounding mode instead of being evaluated at compile
time with default rounding mode (except static object initializers).
This can change the computed result and cause performance regression.
The known correctness issues (increased ulp errors) are already fixed,
this patch fixes remaining cases of unnecessary runtime conversions.
Add float M_* macros to math.h as new GNU extension API. To avoid
conversions the new M_* macros are used and instead of casting double
literals to float, use float literals (only required if the conversion
is inexact).
The patch was tested on aarch64 where the following symbols had new
spurious conversion instructions that got fixed:
__clog10f
__gammaf_r_finite@GLIBC_2.17
__j0f_finite@GLIBC_2.17
__j1f_finite@GLIBC_2.17
__jnf_finite@GLIBC_2.17
__kernel_casinhf
__lgamma_negf
__log1pf
__y0f_finite@GLIBC_2.17
__y1f_finite@GLIBC_2.17
cacosf
cacoshf
casinhf
catanf
catanhf
clogf
gammaf_positive
Fixes bug 28713.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
s_cosf.c and s_sinf.c have
if (abstop12 (y) < abstop12 (pio4))
where abstop12 takes a float argument, but pio4 is static const double.
pio4 is used only in calls to abstop12 and never in arithmetic. Apply
-static const double pio4 = 0x1.921FB54442D18p-1;
+static const float pio4 = 0x1.921FB6p-1f;
to fix:
FAIL: math/test-float-cos
FAIL: math/test-float-sin
FAIL: math/test-float-sincos
FAIL: math/test-float32-cos
FAIL: math/test-float32-sin
FAIL: math/test-float32-sincos
when compiling with GCC 12.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The macro TAYLOR_SIN adds the term `-0.5*da*a^2 + da` in hopes
of regaining some precision as a function of da. However the
comment says we add the term `-0.5*da*a^2 + 0.5*da` which is
different. This fix updates the comment to reflect the
code and also simplifies the calculation by replacing `a` with `x`
because they always have the same value.
Signed-off-by: Akila Welihinda <akilawelihinda@ucla.edu>
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The error handling is moved to sysdeps/ieee754 version with no SVID
support. The compatibility symbol versions still use the wrapper with
SVID error handling around the new code. There is no new symbol version
nor compatibility code on !LIBM_SVID_COMPAT targets (e.g. riscv).
Only ia64 is unchanged, since it still uses the arch specific
__libm_error_region on its implementation.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
This implementation is based on 'An Improved Algorithm for hypot(a,b)'
by Carlos F. Borges [1] using the MyHypot3 with the following changes:
- Handle qNaN and sNaN.
- Tune the 'widely varying operands' to avoid spurious underflow
due the multiplication and fix the return value for upwards
rounding mode.
- Handle required underflow exception for subnormal results.
The main advantage of the new algorithm is its precision. With a
random 1e9 input pairs in the range of [LDBL_MIN, LDBL_MAX], glibc
current implementation shows around 0.05% results with an error of
1 ulp (453266 results) while the new implementation only shows
0.0001% of total (1280).
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
[1] https://arxiv.org/pdf/1904.09481.pdf
This implementation is based on 'An Improved Algorithm for hypot(a,b)'
by Carlos F. Borges [1] using the MyHypot3 with the following changes:
- Handle qNaN and sNaN.
- Tune the 'widely varying operands' to avoid spurious underflow
due the multiplication and fix the return value for upwards
rounding mode.
- Handle required underflow exception for subnormal results.
The main advantage of the new algorithm is its precision. With a
random 1e8 input pairs in the range of [LDBL_MIN, LDBL_MAX], glibc
current implementation shows around 0.02% results with an error of
1 ulp (23158 results) while the new implementation only shows
0.0001% of total (111).
[1] https://arxiv.org/pdf/1904.09481.pdf
Improve hypot performance significantly by using fma when available. The
fma version has twice the throughput of the previous version and 70% of
the latency. The non-fma version has 30% higher throughput and 10%
higher latency.
Max ULP error is 0.949 with fma and 0.792 without fma.
Passes GLIBC testsuite.
This implementation is based on the 'An Improved Algorithm for
hypot(a,b)' by Carlos F. Borges [1] using the MyHypot3 with the
following changes:
- Handle qNaN and sNaN.
- Tune the 'widely varying operands' to avoid spurious underflow
due the multiplication and fix the return value for upwards
rounding mode.
- Handle required underflow exception for denormal results.
The main advantage of the new algorithm is its precision: with a
random 1e9 input pairs in the range of [DBL_MIN, DBL_MAX], glibc
current implementation shows around 0.34% results with an error of
1 ulp (3424869 results) while the new implementation only shows
0.002% of total (18851).
The performance result are also only slight worse than current
implementation. On x86_64 (Ryzen 5900X) with gcc 12:
Before:
"hypot": {
"workload-random": {
"duration": 3.73319e+09,
"iterations": 1.12e+08,
"reciprocal-throughput": 22.8737,
"latency": 43.7904,
"max-throughput": 4.37184e+07,
"min-throughput": 2.28361e+07
}
}
After:
"hypot": {
"workload-random": {
"duration": 3.7597e+09,
"iterations": 9.8e+07,
"reciprocal-throughput": 23.7547,
"latency": 52.9739,
"max-throughput": 4.2097e+07,
"min-throughput": 1.88772e+07
}
}
Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
[1] https://arxiv.org/pdf/1904.09481.pdf
Use a more optimized comparison for check for NaN and infinite and
add an inlined issignaling implementation for float. With gcc it
results in 2 FP comparisons.
The file Copyright is also changed to use GPL, the implementation was
completely changed by 7c10fd3515 to use double precision instead of
scaling and this change removes all the GET_FLOAT_WORD usage.
Checked on x86_64-linux-gnu.
The largest errors over the full binary32 range are after this
patch (on x86_64):
RNDN: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDZ: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDU: libm wrong by up to 9.00e+00 ulp(s) [9] for x=0x1.04c39cp+6
RNDD: libm wrong by up to 8.98e+00 ulp(s) [9] for x=0x1.4b7066p+7
Inputs that were yielding huge errors have been added to "make check".
Reviewed-by: Adhemeral Zanella <adhemerval.zanella@linaro.org>
C2X adds new <math.h> functions for floating-point maximum and
minimum, corresponding to the new operations that were added in IEEE
754-2019 because of concerns about the old operations not being
associative in the presence of signaling NaNs. fmaximum and fminimum
handle NaNs like most <math.h> functions (any NaN argument means the
result is a quiet NaN). fmaximum_num and fminimum_num handle both
quiet and signaling NaNs the way fmax and fmin handle quiet NaNs (if
one argument is a number and the other is a NaN, return the number),
but still raise "invalid" for a signaling NaN argument, making them
exceptions to the normal rule that a function with a floating-point
result raising "invalid" also returns a quiet NaN. fmaximum_mag,
fminimum_mag, fmaximum_mag_num and fminimum_mag_num are corresponding
functions returning the argument with greatest or least absolute
value. All these functions also treat +0 as greater than -0. There
are also corresponding <tgmath.h> type-generic macros.
Add these functions to glibc. The implementations use type-generic
templates based on those for fmax, fmin, fmaxmag and fminmag, and test
inputs are based on those for those functions with appropriate
adjustments to the expected results. The RISC-V maintainers might
wish to add optimized versions of fmaximum_num and fminimum_num (for
float and double), since RISC-V (F extension version 2.2 and later)
provides instructions corresponding to those functions - though it
might be at least as useful to add architecture-independent built-in
functions to GCC and teach the RISC-V back end to expand those
functions inline, which is what you generally want for functions that
can be implemented with a single instruction.
Tested for x86_64 and x86, and with build-many-glibcs.py.
Avoid defining f64xfmaf128 twice when building s_fmaf128.c.
This can be reproduced on powerpc64le whenever f128 functions do not
have IFUNC enabled, e.g. using "--with-cpu=power8 --disable-multi-arch", or
when using "-with-cpu=power9".
Fixes: b3f27d8150 ("Add narrowing fma functions")
This patch adds the narrowing fused multiply-add functions from TS
18661-1 / TS 18661-3 / C2X to glibc's libm: ffma, ffmal, dfmal,
f32fmaf64, f32fmaf32x, f32xfmaf64 for all configurations; f32fmaf64x,
f32fmaf128, f64fmaf64x, f64fmaf128, f32xfmaf64x, f32xfmaf128,
f64xfmaf128 for configurations with _Float64x and _Float128;
__f32fmaieee128 and __f64fmaieee128 aliases in the powerpc64le case
(for calls to ffmal and dfmal when long double is IEEE binary128).
Corresponding tgmath.h macro support is also added.
The changes are mostly similar to those for the other narrowing
functions previously added, especially that for sqrt, so the
description of those generally applies to this patch as well. As with
sqrt, I reused the same test inputs in auto-libm-test-in as for
non-narrowing fma rather than adding extra or separate inputs for
narrowing fma. The tests in libm-test-narrow-fma.inc also follow
those for non-narrowing fma.
The non-narrowing fma has a known bug (bug 6801) that it does not set
errno on errors (overflow, underflow, Inf * 0, Inf - Inf). Rather
than fixing this or having narrowing fma check for errors when
non-narrowing does not (complicating the cases when narrowing fma can
otherwise be an alias for a non-narrowing function), this patch does
not attempt to check for errors from narrowing fma and set errno; the
CHECK_NARROW_FMA macro is still present, but as a placeholder that
does nothing, and this missing errno setting is considered to be
covered by the existing bug rather than needing a separate open bug.
missing-errno annotations are duly added to many of the
auto-libm-test-in test inputs for fma.
This completes adding all the new functions from TS 18661-1 to glibc,
so will be followed by corresponding stdc-predef.h changes to define
__STDC_IEC_60559_BFP__ and __STDC_IEC_60559_COMPLEX__, as the support
for TS 18661-1 will be at a similar level to that for C standard
floating-point facilities up to C11 (pragmas not implemented, but
library functions done). (There are still further changes to be done
to implement changes to the types of fromfp functions from N2548.)
Tested as followed: natively with the full glibc testsuite for x86_64
(GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC
11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32
hard float, mips64 (all three ABIs, both hard and soft float). The
different GCC versions are to cover the different cases in tgmath.h
and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in
glibc headers, GCC 7 has proper _Float* support, GCC 8 adds
__builtin_tgmath).
As described in bug 28358, the round-to-odd computations used in the
libm functions that round their results to a narrower format can yield
spurious underflow exceptions in the following circumstances: the
narrowing only narrows the precision of the type and not the exponent
range (i.e., it's narrowing _Float128 to _Float64x on x86_64, x86 or
ia64), the architecture does after-rounding tininess detection (which
applies to all those architectures), the result is inexact, tiny
before rounding but not tiny after rounding (with the chosen rounding
mode) for _Float64x (which is possible for narrowing mul, div and fma,
not for narrowing add, sub or sqrt), so the underflow exception
resulting from the toward-zero computation in _Float128 is spurious
for _Float64x.
Fixed by making ROUND_TO_ODD call feclearexcept (FE_UNDERFLOW) in the
problem cases (as indicated by an extra argument to the macro); there
is never any need to preserve underflow exceptions from this part of
the computation, because the conversion of the round-to-odd value to
the narrower type will underflow in exactly the cases in which the
function should raise that exception, but it may be more efficient to
avoid the extra manipulation of the floating-point environment when
not needed.
Tested for x86_64 and x86, and with build-many-glibcs.py.
include/math.h has a mechanism to redirect internal calls to various
libm functions, that can often be inlined by the compiler, to call
non-exported __* names for those functions in the case when the calls
aren't inlined, with the redirection being disabled when
NO_MATH_REDIRECT. Add fma to the functions to which this mechanism is
applied.
At present, libm-internal fma calls (generally to __builtin_fma*
functions) are only done when it's known the call will be inlined,
with alternative code not relying on an fma operation being used in
the caller otherwise. This patch is in preparation for adding the TS
18661 / C2X narrowing fma functions to glibc; it will be natural for
the narrowing function implementations to call the underlying fma
functions unconditionally, with this either being inlined or resulting
in an __fma* call. (Using two levels of round-to-odd computation like
that, in the case where there isn't an fma hardware instruction, isn't
optimal but is certainly a lot simpler for the initial implementation
than writing different narrowing fma implementations for all the
various pairs of formats.)
Tested with build-many-glibcs.py that installed stripped shared
libraries are unchanged by the patch (using
<https://sourceware.org/pipermail/libc-alpha/2021-September/130991.html>
to fix installed library stripping in build-many-glibcs.py). Also
tested for x86_64.
This patch adds the narrowing square root functions from TS 18661-1 /
TS 18661-3 / C2X to glibc's libm: fsqrt, fsqrtl, dsqrtl, f32sqrtf64,
f32sqrtf32x, f32xsqrtf64 for all configurations; f32sqrtf64x,
f32sqrtf128, f64sqrtf64x, f64sqrtf128, f32xsqrtf64x, f32xsqrtf128,
f64xsqrtf128 for configurations with _Float64x and _Float128;
__f32sqrtieee128 and __f64sqrtieee128 aliases in the powerpc64le case
(for calls to fsqrtl and dsqrtl when long double is IEEE binary128).
Corresponding tgmath.h macro support is also added.
The changes are mostly similar to those for the other narrowing
functions previously added, so the description of those generally
applies to this patch as well. However, the not-actually-narrowing
cases (where the two types involved in the function have the same
floating-point format) are aliased to sqrt, sqrtl or sqrtf128 rather
than needing a separately built not-actually-narrowing function such
as was needed for add / sub / mul / div. Thus, there is no
__nldbl_dsqrtl name for ldbl-opt because no such name was needed
(whereas the other functions needed such a name since the only other
name for that entry point was e.g. f32xaddf64, not reserved by TS
18661-1); the headers are made to arrange for sqrt to be called in
that case instead.
The DIAG_* calls in sysdeps/ieee754/soft-fp/s_dsqrtl.c are because
they were observed to be needed in GCC 7 testing of
riscv32-linux-gnu-rv32imac-ilp32. The other sysdeps/ieee754/soft-fp/
files added didn't need such DIAG_* in any configuration I tested with
build-many-glibcs.py, but if they do turn out to be needed in more
files with some other configuration / GCC version, they can always be
added there.
I reused the same test inputs in auto-libm-test-in as for
non-narrowing sqrt rather than adding extra or separate inputs for
narrowing sqrt. The tests in libm-test-narrow-sqrt.inc also follow
those for non-narrowing sqrt.
Tested as followed: natively with the full glibc testsuite for x86_64
(GCC 11, 7, 6) and x86 (GCC 11); with build-many-glibcs.py with GCC
11, 7 and 6; cross testing of math/ tests for powerpc64le, powerpc32
hard float, mips64 (all three ABIs, both hard and soft float). The
different GCC versions are to cover the different cases in tgmath.h
and tgmath.h tests properly (GCC 6 has _Float* only as typedefs in
glibc headers, GCC 7 has proper _Float* support, GCC 8 adds
__builtin_tgmath).
We stopped adding "Contributed by" or similar lines in sources in 2012
in favour of git logs and keeping the Contributors section of the
glibc manual up to date. Removing these lines makes the license
header a bit more consistent across files and also removes the
possibility of error in attribution when license blocks or files are
copied across since the contributed-by lines don't actually reflect
reality in those cases.
Move all "Contributed by" and similar lines (Written by, Test by,
etc.) into a new file CONTRIBUTED-BY to retain record of these
contributions. These contributors are also mentioned in
manual/contrib.texi, so we just maintain this additional record as a
courtesy to the earlier developers.
The following scripts were used to filter a list of files to edit in
place and to clean up the CONTRIBUTED-BY file respectively. These
were not added to the glibc sources because they're not expected to be
of any use in future given that this is a one time task:
https://gist.github.com/siddhesh/b5ecac94eabfd72ed2916d6d8157e7dchttps://gist.github.com/siddhesh/15ea1f5e435ace9774f485030695ee02
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This patch is using the corresponding GCC builtin for roundevenf,
roundeven and roundevenl if the USE_FUNCTION_BUILTIN macros are defined
to one in math-use-builtins.h.
These builtin functions is supported since GCC 10.
The code of the generic implementation is not changed.
Signed-off-by: Shen-Ta Hsieh <ibmibmibm.tw@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch redirect roundeven function for futhermore changes.
Signed-off-by: Shen-Ta Hsieh <ibmibmibm.tw@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch replaced obsolete AC_TRY_COMPILE to AC_COMPILE_IFELSE or
AC_PREPROC_IFELSE.
It has been confirmed that GNU 'autoconf' 2.69 suppressed obsolete
warnings, updated the following files:
- configure
- sysdeps/mach/configure
- sysdeps/mach/hurd/configure
- sysdeps/s390/configure
- sysdeps/unix/sysv/linux/configure
and didn't change the following files:
- sysdeps/ieee754/ldbl-opt/configure
- sysdeps/unix/sysv/linux/powerpc/configure
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The symbol has never been exported, so no compatibility symbol is
needed. Removing this file prevents ld from creation an exported
symbol in case GLIBC_2_0 expands to a symbol version which
does not have a local: *; directive in the symbol version map file.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
These workload traces cover the whole "long double" range.
This patch was prepared with the help of Adhemerval Zanella.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
With this patch, the maximal known error for tgamma is now reduced to 9 ulps
for dbl-64, for all rounding modes. Since exhaustive testing is not possible
for dbl-64, it might be that there are still cases with an error larger than
9 ulps, but all known cases are fixed (intensive tests were done to find cases
with large errors).
Tested on x86_64 and powerpc (and by Adhemerval Zanella on aarch64, arm,
s390x, sparc, and i686).
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
For j0f/j1f/y0f/y1f, the largest error for all binary32
inputs is reduced to at most 9 ulps for all rounding modes.
The new code is enabled only when there is a cancellation at the very end of
the j0f/j1f/y0f/y1f computation, or for very large inputs, thus should not
give any visible slowdown on average. Two different algorithms are used:
* around the first 64 zeros of j0/j1/y0/y1, approximation polynomials of
degree 3 are used, computed using the Sollya tool (https://www.sollya.org/)
* for large inputs, an asymptotic formula from [1] is used
[1] Fast and Accurate Bessel Function Computation,
John Harrison, Proceedings of Arith 19, 2009.
Inputs yielding the new largest errors are added to auto-libm-test-in,
and ulps are regenerated for various targets (thanks Adhemerval Zanella).
Tested on x86_64 with --disable-multi-arch and on powerpc64le-linux-gnu.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
All of the isnan functions are in libc.so due to printf_fp, so move
__isnanf128 there too for consistency.
Reviewed-by: Tulio Magno Quites Machado Filho <tuliom@ascii.art.br>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Previous commit was missing deleted files in sysdeps/ieee754/dbl-64.
Finally remove all mpa related files, headers, declarations, probes, unused
tables and update makefiles.
Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Finally remove all mpa related files, headers, declarations, probes, unused
tables and update makefiles.
Reviewed-By: Paul Zimmermann <Paul.Zimmermann@inria.fr>