Commit Graph

1614 Commits

Author SHA1 Message Date
Adhemerval Zanella 5fb4b566ef math: Use asinf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic asinf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      42.8237        35.2460        17.70%
x86_64v2                    43.3711        35.9406        17.13%
x86_64v3                    35.0335        30.5744        12.73%
i686                       213.8780        104.4710       51.15%
aarch64 (Neoverse)          17.2937        13.6025        21.34%
power10                     12.0227        7.4241         38.25%

reciprocal-throughput        master        patched   improvement
x86_64                      13.6770        15.5231       -13.50%
x86_64v2                    13.8722        16.0446       -15.66%
x86_64v3                    13.6211        13.2753         2.54%
i686                       186.7670        45.4388        75.67%
aarch64 (Neoverse)          9.96089        9.39285         5.70%
power10                      4.9862        3.7819         24.15%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 673e6fe110 math: Use acoshf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acoshf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      61.2471        58.7742         4.04%
x86_64-v2                   62.6519        59.0523         5.75%
x86_64-v3                   58.7408        50.1393        14.64%
aarch64                     24.8580        21.3317        14.19%
power10                     17.0469        13.1345        22.95%

reciprocal-throughput        master        patched   improvement
x86_64                      16.1618        15.1864         6.04%
x86_64-v2                   15.7729        14.7563         6.45%
x86_64-v3                   14.1669        11.9568        15.60%
aarch64                      10.911        9.5486         12.49%
power10                     6.38196        5.06734        20.60%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 66fa7ad437 math: Use acosf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic acosf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      52.5098        36.6312        30.24%
x86_64v2                    53.0217        37.3091        29.63%
x86_64v3                    42.8501        32.3977        24.39%
i686                       207.3960       109.4000        47.25%
aarch64                     21.3694        13.7871        35.48%
power10                     14.5542         7.2891        49.92%

reciprocal-throughput        master        patched   improvement
x86_64                      14.1487        15.9508       -12.74%
x86_64v2                    14.3293        16.1899       -12.98%
x86_64v3                    13.6563        12.6161         7.62%
i686                       158.4060        45.7354        71.13%
aarch64                     12.5515        9.19233        26.76%
power10                      5.7868         3.3487        42.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-12-18 17:24:43 -03:00
Adhemerval Zanella 849c73fe2b powerpc: Update libm-test-ulps
Regen to add new functions acospi, asinpi, atan2pi, atanpi, and
tanpi.
2024-12-18 15:43:09 -03:00
Joseph Myers 3374de9038 Implement C23 atan2pi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the atan2pi functions (atan2(y,x)/pi).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-12 20:57:44 +00:00
Joseph Myers ffe79c446c Implement C23 atanpi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the atanpi functions (atan(x)/pi).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-11 21:51:49 +00:00
Peter Bergner aec85b2557 powerpc64: Fix dl-trampoline.S big-endian / non-ROP build failure
Fix a big-endian / non-ROP build failure caused by commit 4d9a4c02 when
building dl-trampoline.S.

Reported-by: Joseph Myers <josmyers@redhat.com>
2024-12-11 23:15:13 +03:00
Peter Bergner 4d9a4c02f9 powerpc64le: ROP changes for the dl-trampoline functions
Add ROP protection for the _dl_runtime_resolve and _dl_profile_resolve
functions.
2024-12-10 23:25:56 -05:00
Joseph Myers f962932206 Implement C23 asinpi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the asinpi functions (asin(x)/pi).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-10 20:42:20 +00:00
Joseph Myers 28d102d15c Implement C23 acospi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the acospi functions (acos(x)/pi).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-09 23:01:29 +00:00
Sachin Monga be13e46764 powerpc64le: ROP changes for the *context and setjmp functions
Add ROP protection for the getcontext, setcontext, makecontext, swapcontext
and __sigsetjmp_symbol functions.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-12-09 16:49:54 -05:00
Joseph Myers f9e90e4b4c Implement C23 tanpi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the tanpi functions (tan(pi*x)).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-05 21:42:10 +00:00
Adhemerval Zanella c8d3220e64 powerpc: Update ulps
From 'Implement C23 cospi' (0ae0af68d8)
and 'Implement C23 sinpi' (776938e8b8).
2024-12-05 13:35:24 -03:00
Joseph Myers 776938e8b8 Implement C23 sinpi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the sinpi functions (sin(pi*x)).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-04 20:04:04 +00:00
Joseph Myers 0ae0af68d8 Implement C23 cospi
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the cospi functions (cos(pi*x)).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-12-04 10:20:44 +00:00
Adhemerval Zanella 17a43505b3 elf: Consolidate stackinfo.h
And use sane default the generic implementation.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-12-02 17:14:58 +00:00
Adhemerval Zanella 3b1c5a539b math: Add internal roundeven_finite
Some CORE-MATH routines uses roundeven and most of ISA do not have
an specific instruction for the operation.  In this case, the call
will be routed to generic implementation.

However, if the ISA does support round() and ctz() there is a better
alternative (as used by CORE-MATH).

This patch adds such optimization and also enables it on powerpc.
On a power10 it shows the following improvement:

expm1f                      master      patched       improvement
latency                     9.8574       7.0139            28.85%
reciprocal-throughput       4.3742       2.6592            39.21%

Checked on powerpc64le-linux-gnu and aarch64-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-26 15:07:57 -03:00
Sachin Monga 2062e02772 powerpc64le: ROP Changes for strncpy/ppc-mount
Add ROP protect instructions to strncpy and ppc-mount functions.
Modify FRAME_MIN_SIZE to 48 bytes for ELFv2 to reserve additional
16 bytes for ROP save slot and padding.

Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-11-25 10:44:20 -05:00
Adhemerval Zanella bccb0648ea math: Use tanf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       82.3961       54.8052       33.49%
x86_64v2                     82.3415       54.8052       33.44%
x86_64v3                     69.3661       50.4864       27.22%
i686                         219.271       45.5396       79.23%
aarch64                      29.2127       19.1951       34.29%
power10                      19.5060       16.2760       16.56%

reciprocal-throughput         master       patched  improvement
x86_64                       28.3976       19.7334       30.51%
x86_64v2                     28.4568       19.7334       30.65%
x86_64v3                     21.1815       16.1811       23.61%
i686                         105.016       15.1426       85.58%
aarch64                      18.1573       10.7681       40.70%
power10                       8.7207        8.7097        0.13%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella d846f4c12d math: Use lgammaf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic lgammaf.

The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, to use math_narrow_eval
on overflow usage, and to adapt to make it reentrant.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       86.5609       70.3278       18.75%
x86_64v2                     78.3030       69.9709       10.64%
x86_64v3                     74.7470       59.8457       19.94%
i686                         387.355       229.761       40.68%
aarch64                      40.8341       33.7563       17.33%
power10                      26.5520       16.1672       39.11%
powerpc                      28.3145       17.0625       39.74%

reciprocal-throughput         master       patched  improvement
x86_64                       68.0461       48.3098       29.00%
x86_64v2                     55.3256       47.2476       14.60%
x86_64v3                     52.3015       38.9028       25.62%
i686                         340.848       195.707       42.58%
aarch64                      36.8000       30.5234       17.06%
power10                      20.4043       12.6268       38.12%
powerpc                      22.6588       13.8866       38.71%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella baa495f231 math: Use erfcf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erfcf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       98.8796       66.2142       33.04%
x86_64v2                     98.9617       67.4221       31.87%
x86_64v3                     87.4161       53.1754       39.17%
aarch64                      33.8336       22.0781       34.75%
power10                      21.1750       13.5864       35.84%
powerpc                      21.4694       13.8149       35.65%

reciprocal-throughput         master       patched  improvement
x86_64                       48.5620       27.6731       43.01%
x86_64v2                     47.9497       28.3804       40.81%
x86_64v3                     42.0255       18.1355       56.85%
aarch64                      24.3938       13.4041       45.05%
power10                      10.4919        6.1881       41.02%
powerpc                       11.763       6.76468       42.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella 994fec2397 math: Use erff from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic erff.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master       patched  improvement
x86_64                       85.7363       45.1372       47.35%
x86_64v2                     86.6337       38.5816       55.47%
x86_64v3                     71.3810       34.0843       52.25%
i686                         190.143       97.5014       48.72%
aarch64                      34.9091       14.9320       57.23%
power10                      38.6160        8.5188       77.94%
powerpc                      39.7446       8.45781       78.72%

reciprocal-throughput         master       patched  improvement
x86_64                       35.1739       14.7603       58.04%
x86_64v2                     34.5976       11.2283       67.55%
x86_64v3                     27.3260        9.8550       63.94%
i686                         91.0282       30.8840       66.07%
aarch64                      22.5831        6.9615       69.17%
power10                      18.0386        3.0918       82.86%
powerpc                      20.7277       3.63396       82.47%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-22 10:52:27 -03:00
Adhemerval Zanella c5d241f06b math: Use cbrtf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cbrtf.

The code was adapted to glibc style and to use the definition of
math_config.h.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

latency                       master        patched       improvement
x86_64                       68.6348        36.8908            46.25%
x86_64v2                     67.3418        36.6968            45.51%
x86_64v3                     63.4981        32.7859            48.37%
aarch64                      29.3172        12.1496            58.56%
power10                      18.0845         8.8893            50.85%
powerpc                      18.0859        8.79527            51.37%

reciprocal-throughput         master        patched       improvement
x86_64                       36.4369        13.3565            63.34%
x86_64v2                     37.3611        13.1149            64.90%
x86_64v3                     31.6024        11.2102            64.53%
aarch64                      18.6866        7.3474             60.68%
power10                       9.4758        3.6329             61.66%
powerpc                      9.58896        3.90439            59.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-22 10:01:03 -03:00
Sachin Monga 3051f3495c powerpc64le: _init/_fini file changes for ROP
The ROP instructions were added in ISA 3.1 (ie, Power10), however they
were defined so that if executed on older cpus, they would behave as
nops.  This allows us to emit them on older cpus and they'd just be
ignored, but if run on a Power10, then the binary would be ROP protected.

Hash instructions use negative offsets so the default position
of ROP pointer is FRAME_ROP_SAVE from caller's SP.

Modified FRAME_MIN_SIZE_PARM to 112 for ELFv2 to reserve
additional 16 bytes for ROP save slot and padding.

Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-11-20 16:50:34 -05:00
Mahesh Bodapati 3ef7e42861 powerpc64le: Optimized strcat for POWER10
This patch adds an optimized strcat which makes use of the default
strcat function which calls the Power10 strcpy and strlen routines.
2024-11-19 15:59:15 -05:00
Adhemerval Zanella f338c7c5f5 math: Use log10p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log10p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      68.5251        32.2627        52.92%
x86_64v2                    68.8912        32.7887        52.41%
x86_64v3                    59.3427        27.0521        54.41%
i686                        162.026        103.383        36.19%
aarch64                     26.8513        14.5695        45.74%
power10                     12.7426         8.4929        33.35%
powerpc                     16.6768        9.29135        44.29%

reciprocal-throughput        master        patched   improvement
x86_64                      26.0969        12.4023        52.48%
x86_64v2                    25.0045        11.0748        55.71%
x86_64v3                    20.5610        10.2995        49.91%
i686                        89.8842        78.5211        12.64%
aarch64                     17.1200         9.4832        44.61%
power10                      6.7814         6.4258         5.24%
powerpc                      15.769         7.6825        51.28%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:40 -03:00
Adhemerval Zanella 8ae9e51376 math: Use log1pf from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows slight better performance to the generic log1pf.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (M1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      71.8142        38.9668        45.74%
x86_64v2                    71.9094        39.1321        45.58%
x86_64v3                    60.1000        32.4016        46.09%
i686                        147.105        104.258        29.13%
aarch64                     26.4439        14.0050        47.04%
power10                     19.4874         9.4146        51.69%
powerpc                     17.6145        8.00736        54.54%

reciprocal-throughput        master        patched   improvement
x86_64                      19.7604        12.7254        35.60%
x86_64v2                    19.0039        11.9455        37.14%
x86_64v3                    16.8559        11.9317        29.21%
i686                        82.3426        73.9718        10.17%
aarch64                     14.4665         7.9614        44.97%
power10                     11.9974         8.4117        29.89%
powerpc                     7.15222         6.0914        14.83%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella c369580814 math: Use log2p1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic log2p1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      70.1462        47.0090        32.98%
x86_64v2                    70.2513        47.6160        32.22%
x86_64v3                    60.4840        39.9443        33.96%
i686                        164.068        122.909        25.09%
aarch64                     25.9169        16.9207        34.71%
power10                     18.1261        9.8592         45.61%
powerpc                     17.2683        9.38665        45.64%

reciprocal-throughput        master        patched   improvement
x86_64                      26.2240        16.4082        37.43%
x86_64v2                    25.0911        15.7480        37.24%
x86_64v3                    20.9371        11.7264        43.99%
i686                        90.4209        95.3073        -5.40%
aarch64                     16.8537        8.9561         46.86%
power10                     12.9401        6.5555         49.34%
powerpc                     9.01763        7.54745        16.30%

The performance decrease for i686 is mostly due the use of x87 fpu,
when building with '-msse2 -mfpmath=sse:

                             master        patched   improvement
latency                     164.068        102.982        37.23%
reciprocal-throughput       89.1968        82.5117         7.49%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:39 -03:00
Adhemerval Zanella bbd578b38d math: Use expm1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic expm1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      96.7402        36.4026        62.37%
x86_64v2                    97.5391        33.4625        65.69%
x86_64v3                    82.1778        30.8668        62.44%
i686                         120.58        94.8302        21.35%
aarch64                     32.3558        12.8881        60.17%
power10                     23.5087        9.8574         58.07%
powerpc                     23.4776        9.06325        61.40%

reciprocal-throughput        master        patched   improvement
x86_64                      27.8224        15.9255        42.76%
x86_64v2                    27.8364        9.6438         65.36%
x86_64v3                    20.3227        9.6146         52.69%
i686                        63.5629        59.4718         6.44%
aarch64                     17.4838        7.1082         59.34%
power10                     12.4644        8.7829         29.54%
powerpc                     14.2152        5.94765        58.16%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella 5c22fd25c1 math: Use exp2m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.

The benchmark inputs are based on exp2f ones.

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      40.6042        48.7104       -19.96%
x86_64v2                    40.7506        35.9032        11.90%
x86_64v3                    35.2301        31.7956        9.75%
i686                        102.094        94.6657        7.28%
aarch64                     18.2704        15.1387        17.14%
power10                     11.9444         8.2402        31.01%

reciprocal-throughput        master        patched   improvement
x86_64                      20.8683        16.1428        22.64%
x86_64v2                    19.5076        10.4474        46.44%
x86_64v3                    19.2106        10.4014        45.86%
i686                        56.4054        59.3004        -5.13%
aarch64                     12.0781         7.3953        38.77%
power10                      6.5306         5.9388         9.06%

The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded).  This explains the performance difference for x86_64.

Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'.  When built wth same
flags, the new algorithm shows a better performance:

                            master        patched    improvement
latency                    102.094        91.2823         10.59%
reciprocal-throughput      56.4054        52.7984          6.39%

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:35 -03:00
Adhemerval Zanella 5fa89852fa math: Use exp10m1f from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).  I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

Latency                      master        patched   improvement
x86_64                      45.4690        49.5845        -9.05%
x86_64v2                    46.1604        36.2665        21.43%
x86_64v3                    37.8442        31.0359        17.99%
i686                        121.367        93.0079        23.37%
aarch64                     21.1126        15.0165        28.87%
power10                     12.7426        8.4929         33.35%

reciprocal-throughput        master        patched   improvement
x86_64                      19.6005        17.4005        11.22%
x86_64v2                    19.6008        11.1977        42.87%
x86_64v3                    17.5427        10.2898        41.34%
i686                        59.4215        60.9675        -2.60%
aarch64                     13.9814        7.9173         43.37%
power10                      6.7814        6.4258          5.24%

The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01 11:27:26 -03:00
Sachin Monga f144dae4a1 powerpc64le: Adhere to ABI stack alignment requirement
The ABI requires all stack frames be 16-byte aligned.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-10-28 16:12:34 -05:00
Paul Zimmermann 392b3f0971 replace tgammaf by the CORE-MATH implementation
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.

Tested on x86_64 and x86 (cfarm26).

With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:

      "tgammaf": {
       "": {
        "duration": 3.50188e+09,
        "iterations": 2e+07,
        "max": 602.891,
        "min": 65.1415,
        "mean": 175.094
       }
      }

With the new code:

      "tgammaf": {
       "": {
        "duration": 3.30825e+09,
        "iterations": 5e+07,
        "max": 211.592,
        "min": 32.0325,
        "mean": 66.1649
       }
      }

With the initial GNU libc code it gave on cfarm26 (i686):

  "tgammaf": {
   "": {
    "duration": 3.70505e+09,
    "iterations": 6e+06,
    "max": 2420.23,
    "min": 243.154,
    "mean": 617.509
   }
  }

With the new code:

  "tgammaf": {
   "": {
    "duration": 3.24497e+09,
    "iterations": 1.8e+07,
    "max": 1238.15,
    "min": 101.155,
    "mean": 180.276
   }
  }

Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

Changes in v2:
    - include <math.h> (fix the linknamespace failures)
    - restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
    - restored original wrapper code (math/w_tgammaf_compat.c),
      except for the dealing with the sign
    - removed the tgammaf/float entries in all libm-test-ulps files
    - address other comments from Joseph Myers
      (https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)

Changes in v3:
    - pass NULL argument for signgam from w_tgammaf_compat.c
    - use of math_narrow_eval
    - added more comments

Changes in v4:
    - initialize local_signgam to 0 in math/w_tgamma_template.c
    - replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file

Changes in v5:
    - do not mention local_signgam any more in math/w_tgammaf_compat.c
    - initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
      and added comment

Changes in v6:
    - pass NULL as 2nd argument of __ieee754_gammaf_r in
      w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c

Changes in v7:
    - added Signed-off-by line for Alexei Sibidanov (author of the code)

Changes in v8:
    - added Signed-off-by line for Paul Zimmermann (submitted of the patch)

Changes in v9:
    - address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-10-11 11:12:32 +02:00
Carlos O'Donell cae9944a6c Fix whitespace related license issues.
Several copies of the licenses in files contained whitespace related
problems.  Two cases are addressed here, the first is two spaces
after a period which appears between "PURPOSE." and "See". The other
is a space after the last forward slash in the URL. Both issues are
corrected and the licenses now match the official textual description
of the license (and the other license in the sources).

Since these whitespaces changes do not alter the paragraph structure of
the license, nor create new sentences, they do not change the license.
2024-10-07 18:08:16 -04:00
Florian Weimer cc3e743fc0 powerpc64le: Build new strtod tests with long double ABI flags (bug 32145)
This fixes several test failures:

=====FAIL: stdlib/tst-strtod1i.out=====
Locale tests
all OK
Locale tests
all OK
Locale tests
strtold("1,5") returns -6,38643e+367 and not 1,5
strtold("1.5") returns 1,5 and not 1
strtold("1.500") returns 1 and not 1500
strtold("36.893.488.147.419.103.232") returns 1500 and not 3,68935e+19
Locale tests
all OK

=====FAIL: stdlib/tst-strtod3.out=====
0: got wrong results -2.5937e+4826, expected 0

=====FAIL: stdlib/tst-strtod4.out=====
0: got wrong results -6,38643e+367, expected 0
1: got wrong results 0, expected 1e+06
2: got wrong results 1e+06, expected 10

=====FAIL: stdlib/tst-strtod5i.out=====
0: got wrong results -6,38643e+367, expected 0
2: got wrong results 0, expected -0
4: got wrong results -0, expected 0
5: got wrong results 0, expected -0
6: got wrong results -0, expected 0
7: got wrong results 0, expected -0
8: got wrong results -0, expected 0
9: got wrong results 0, expected -0
10: got wrong results -0, expected 0
11: got wrong results 0, expected -0
12: got wrong results -0, expected 0
13: got wrong results 0, expected -0
14: got wrong results -0, expected 0
15: got wrong results 0, expected -0
16: got wrong results -0, expected 0
17: got wrong results 0, expected -0
18: got wrong results -0, expected 0
20: got wrong results 0, expected -0
22: got wrong results -0, expected 0
23: got wrong results 0, expected -0
24: got wrong results -0, expected 0
25: got wrong results 0, expected -0
26: got wrong results -0, expected 0
27: got wrong results 0, expected -0

Fixes commit 3fc063dee0
("Make __strtod_internal tests type-generic").

Suggested-by: Joseph Myers <josmyers@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-09-05 22:02:23 +02:00
Jeevitha Palanisamy 29f0db6a2e powerpc64: Fix syscall_cancel build for powerpc64le-linux-gnu [BZ #32125]
In __syscall_cancel_arch, there's a tail call to __syscall_do_cancel.
On P10, since the caller uses the TOC and the callee is using
PC-relative addressing, there's only a branch instruction with no NOPs
to restore the TOC, which causes the build error. The fix involves adding
the NOTOC directive to the branch instruction, informing the linker
not to generate a TOC stub, thus resolving the issue.
2024-08-30 08:50:47 -05:00
Mahesh Bodapati 82b5340ebd powerpc64: Optimize strcpy and stpcpy for Power9/10
This patch modifies the current Power9 implementation of strcpy and
stpcpy to optimize it for Power9 and Power10.

No new Power10 instructions are used, so the original Power9 strcpy
is modified instead of creating a new implementation for Power10.

The changes also affect stpcpy, which uses the same implementation
with some additional code before returning.

Improvements compared to the old Power9 version:

Use simple comparisons for the first ~512 bytes:
  The main loop is good for long strings, but comparing 16B each time is
  better for shorter strings. After aligning the address to 16 bytes, we
  unroll the loop four times, checking 128 bytes each time. There may be
  some overlap with the main loop for unaligned strings, but it is better
  for shorter strings.

Loop with 64 bytes for longer bytes:
  Use 4 consecutive lxv/stxv instructions.

Showed an average improvement of 13%.

Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2024-08-23 16:48:32 -05:00
Adhemerval Zanella 89b53077d2 nptl: Fix Race conditions in pthread cancellation [BZ#12683]
The current racy approach is to enable asynchronous cancellation
before making the syscall and restore the previous cancellation
type once the syscall returns, and check if cancellation has happen
during the cancellation entrypoint.

As described in BZ#12683, this approach shows 2 problems:

  1. Cancellation can act after the syscall has returned from the
     kernel, but before userspace saves the return value.  It might
     result in a resource leak if the syscall allocated a resource or a
     side effect (partial read/write), and there is no way to program
     handle it with cancellation handlers.

  2. If a signal is handled while the thread is blocked at a cancellable
     syscall, the entire signal handler runs with asynchronous
     cancellation enabled.  This can lead to issues if the signal
     handler call functions which are async-signal-safe but not
     async-cancel-safe.

For the cancellation to work correctly, there are 5 points at which the
cancellation signal could arrive:

	[ ... )[ ... )[ syscall ]( ...
	   1      2        3    4   5

  1. Before initial testcancel, e.g. [*... testcancel)
  2. Between testcancel and syscall start, e.g. [testcancel...syscall start)
  3. While syscall is blocked and no side effects have yet taken
     place, e.g. [ syscall ]
  4. Same as 3 but with side-effects having occurred (e.g. a partial
     read or write).
  5. After syscall end e.g. (syscall end...*]

And libc wants to act on cancellation in cases 1, 2, and 3 but not
in cases 4 or 5.  For the 4 and 5 cases, the cancellation will eventually
happen in the next cancellable entrypoint without any further external
event.

The proposed solution for each case is:

  1. Do a conditional branch based on whether the thread has received
     a cancellation request;

  2. It can be caught by the signal handler determining that the saved
     program counter (from the ucontext_t) is in some address range
     beginning just before the "testcancel" and ending with the
     syscall instruction.

  3. SIGCANCEL can be caught by the signal handler and determine that
     the saved program counter (from the ucontext_t) is in the address
     range beginning just before "testcancel" and ending with the first
     uninterruptable (via a signal) syscall instruction that enters the
      kernel.

  4. In this case, except for certain syscalls that ALWAYS fail with
     EINTR even for non-interrupting signals, the kernel will reset
     the program counter to point at the syscall instruction during
     signal handling, so that the syscall is restarted when the signal
     handler returns.  So, from the signal handler's standpoint, this
     looks the same as case 2, and thus it's taken care of.

  5. For syscalls with side-effects, the kernel cannot restart the
     syscall; when it's interrupted by a signal, the kernel must cause
     the syscall to return with whatever partial result is obtained
     (e.g. partial read or write).

  6. The saved program counter points just after the syscall
     instruction, so the signal handler won't act on cancellation.
     This is similar to 4. since the program counter is past the syscall
     instruction.

So The proposed fixes are:

  1. Remove the enable_asynccancel/disable_asynccancel function usage in
     cancellable syscall definition and instead make them call a common
     symbol that will check if cancellation is enabled (__syscall_cancel
     at nptl/cancellation.c), call the arch-specific cancellable
     entry-point (__syscall_cancel_arch), and cancel the thread when
     required.

  2. Provide an arch-specific generic system call wrapper function
     that contains global markers.  These markers will be used in
     SIGCANCEL signal handler to check if the interruption has been
     called in a valid syscall and if the syscalls has side-effects.

     A reference implementation sysdeps/unix/sysv/linux/syscall_cancel.c
     is provided.  However, the markers may not be set on correct
     expected places depending on how INTERNAL_SYSCALL_NCS is
     implemented by the architecture.  It is expected that all
     architectures add an arch-specific implementation.

  3. Rewrite SIGCANCEL asynchronous handler to check for both canceling
     type and if current IP from signal handler falls between the global
     markers and act accordingly.

  4. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to
     use the appropriate cancelable syscalls.

  5. Adjust 'lowlevellock-futex.h' arch-specific implementations to
     provide cancelable futex calls.

Some architectures require specific support on syscall handling:

  * On i386 the syscall cancel bridge needs to use the old int80
    instruction because the optimized vDSO symbol the resulting PC value
    for an interrupted syscall points to an address outside the expected
    markers in __syscall_cancel_arch.  It has been discussed in LKML [1]
    on how kernel could help userland to accomplish it, but afaik
    discussion has stalled.

    Also, sysenter should not be used directly by libc since its calling
    convention is set by the kernel depending of the underlying x86 chip
    (check kernel commit 30bfa7b3488bfb1bb75c9f50a5fcac1832970c60).

  * mips o32 is the only kABI that requires 7 argument syscall, and to
    avoid add a requirement on all architectures to support it, mips
    support is added with extra internal defines.

Checked on aarch64-linux-gnu, arm-linux-gnueabihf, powerpc-linux-gnu,
powerpc64-linux-gnu, powerpc64le-linux-gnu, i686-linux-gnu, and
x86_64-linux-gnu.

[1] https://lkml.org/lkml/2016/3/8/1105
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-08-23 14:27:43 -03:00
Florian Weimer 9446351dac powerpc64le: Update ulps
Based on results from a POWER8 system with a GCC 8 build.
2024-08-08 13:42:12 +02:00
Adhemerval Zanella 6396e10b20 powerpc: Regenerate ULPs for soft-fp
From new tests added by 0797283910.
2024-08-07 11:02:03 -03:00
Adhemerval Zanella 6411dba836 powerpc: Update soft-fp ulps
From new tests added by 0797283910.
2024-08-07 11:02:03 -03:00
Adhemerval Zanella fa00661082 powerpc: Regenerate ULPs for soft-fp
From new tests added by 4dc22baa84.
2024-07-25 10:33:40 -03:00
jeevitha 4e40c8104f powerpc: Update ulps for fpu
Adjust the ULPs for the log2p1 implementation.
2024-07-25 10:28:47 -03:00
Adhemerval Zanella e0f7da7235
powerpc: Update soft-fp ulps
Results based on regen-ulps using gcc 11.2.1 on a POWER8 machine.
2024-07-19 19:29:35 +02:00
Florian Weimer 71dafdf5f1 powerpc: Update ulps
Results based on POWER8 and POWER9 machines running
powerpc64-linux-gnu, with and without --disable-multi-arch.
2024-06-20 12:15:31 +02:00
Adhemerval Zanella 52b397bafa powerpc: Update ulps
For the exp10m1, exp2m1, and log10p1 implementations.
2024-06-18 17:31:10 -03:00
Stefan Liebler e260ceb4aa elf: Remove HWCAP_IMPORTANT
Remove the definitions of HWCAP_IMPORTANT after removal of
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask.  There HWCAP_IMPORTANT
was used as default value.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler 343439a31e elf: Remove _DL_PLATFORMS_COUNT
Remove the definitions of _DL_PLATFORMS_COUNT as those are not used
anymore after removal in elf/dl-cache.c:search_cache().

Note: On x86, we can also get rid of the definitions
HWCAP_PLATFORMS_START and HWCAP_PLATFORMS_COUNT.
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Stefan Liebler ed23449dac elf: Remove _DL_HWCAP_PLATFORM
Remove the definitions of _DL_HWCAP_PLATFORM as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2024-06-18 10:45:36 +02:00
Andreas K. Hüttel 98ffc1bfeb
Convert to autoconf 2.72 (vanilla release, no distribution patches)
As discussed at the patch review meeting

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
2024-06-17 21:15:28 +02:00