Commit Graph

40385 Commits

Author SHA1 Message Date
Florian Weimer 51210d6496 nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)
[BZ #25847]

The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781 ("nptl: Remove
g_refs from condition variables").  Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.

(cherry picked from commit dbc5a50d12)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:56:56 -04:00
Malte Skarupke 39a80f4035 nptl: Use all of g1_start and g_signals
[BZ #25847]

The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:

 "Waiters cannot determine whether they are currently in G2 or G1 -- but they
  do not have to because all they are interested in is whether there are
  available signals"

I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.

(cherry picked from commit 91bb902f58)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:56:09 -04:00
Malte Skarupke 8899e89b29 nptl: rename __condvar_quiesce_and_switch_g1
[BZ #25847]

This function no longer waits for threads to leave g1, so rename it to
__condvar_switch_g1

(cherry picked from commit 4b79e27a50)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:56:03 -04:00
Malte Skarupke 5765653697 nptl: Fix indentation
[BZ #25847]

In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.

(cherry picked from commit ee6c14ed59)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:48 -04:00
Malte Skarupke 6bac834c5a nptl: Use a single loop in pthread_cond_wait instaed of a nested loop
[BZ #25847]

The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.

(cherry picked from commit 929a4764ac)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:40 -04:00
Malte Skarupke 7625579f11 nptl: Remove g_refs from condition variables
[BZ #25847]

This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.

(cherry picked from commit c36fc50781)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:31 -04:00
Malte Skarupke 44eaf0615d nptl: Remove unnecessary quadruple check in pthread_cond_wait
[BZ #25847]

pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:

1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
   for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
   close while we do an atomic increment. Obviously it could close at any
   point, but that doesn't mean we have to recheck after every step. This
   check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
   we check that we're not in a closed group. There is no need to check again.
   The check would only have helped in cases where the compare_exchange in the
   next line would also have failed. Relying on the compare_exchange is fine.

Removing the duplicate checks clarifies the code.

(cherry picked from commit 4f7b051f8e)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:25 -04:00
Malte Skarupke 1fa5e51897 nptl: Remove unnecessary catch-all-wake in condvar group switch
[BZ #25847]

This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.

In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.

(cherry picked from commit b42cc6af11)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:18 -04:00
Malte Skarupke b5c4727e59 nptl: Update comments and indentation for new condvar implementation
[BZ #25847]

Some comments were wrong after the most recent commit. This fixes that.

Also fixing indentation where it was using spaces instead of tabs.

(cherry picked from commit 0cc973160c)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:55:01 -04:00
Frank Barrus 1a0d73a625 pthreads NPTL: lost wakeup fix 2
[BZ #25847]

This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch.  This greatly reduces the average
and maximum latency for pthread_cond_signal.

The g_signals[] field now contains a signal count that is relative to
the current g1_start value.  Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)

By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same.  This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change.  This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.

By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore.  The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit.  If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.

The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters.  Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2.  They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.

Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block.  This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.

(cherry picked from commit 1db84775f8)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-07-11 13:54:25 -04:00
Florian Weimer 5a6276d97a Fix error reporting (false negatives) in SGID tests
And simplify the interface of support_capture_subprogram_self_sgid.

Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion.  Handle all execution
errors within support_capture_subprogram_self_sgid.  In particular,
this includes test failures because the invoked program did not
exit with exit status zero.  Existing tests that expect exit
status 42 are adjusted to use zero instead.

In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).

Fixes commit 35fc356fa3
("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3a3fb2ed83)
2025-06-20 11:12:24 +02:00
Florian Weimer 81f58dd9b7 support: Pick group in support_capture_subprogram_self_sgid if UID == 0
When running as root, it is likely that we can run under any group.
Pick a harmless group from /etc/group in this case.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2f769cec44)
2025-06-20 11:10:41 +02:00
Florian Weimer ca7e32d024 elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)
This should really move into support_capture_subprogram_self_sgid.

Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 35fc356fa3)
2025-05-21 08:57:17 +02:00
Sunil K Pandey ca41fe44a5 x86_64: Fix typo in ifunc-impl-list.c.
Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit f2aeb6ff94)
2025-05-20 16:49:58 -07:00
Florian Weimer 31fa0f73e2 elf: Test case for bug 32976 (CVE-2025-4802)
Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically
linked binaries, using support_capture_subprogram_self_sgid.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit d8f7a79335)
2025-05-20 20:26:07 +02:00
Florian Weimer 4335cd9b58 support: Add support_record_failure_barrier
This can be used to stop execution after a TEST_COMPARE_BLOB
failure, for example.

(cherry picked from commit d0b8aa6de4)
2025-05-20 20:26:07 +02:00
Florian Weimer 454f24e981 support: Use const char * argument in support_capture_subprogram_self_sgid
The function does not modify the passed-in string, so make this clear
via the prototype.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit f0c09fe616)
2025-05-20 20:26:07 +02:00
Adhemerval Zanella 3be3728df2 elf: Ignore LD_LIBRARY_PATH and debug env var for setuid for static
It mimics the ld.so behavior.

Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>

(cherry picked from commit 5451fa962c)

Changes:

	git/elf/dl-support.c
	  (missing commit 55f41ef8de
	   ("elf: Remove LD_PROFILE for static binaries"))
2025-05-20 20:25:57 +02:00
Wilco Dijkstra 5a08d049dc math: Improve layout of exp/exp10 data
GCC aligns global data to 16 bytes if their size is >= 16 bytes.  This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps.  As a result on targets that support them, more load-pair
instructions are used in exp.

The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb)
2025-02-28 14:52:00 +00:00
Wilco Dijkstra 097299ffa9 AArch64: Use prefer_sve_ifuncs for SVE memset
Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1da)
2025-02-28 14:44:25 +00:00
Wilco Dijkstra 52c2b1556f AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes.  Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76)
2025-02-28 14:44:23 +00:00
Wilco Dijkstra 3de5112326 math: Improve layout of expf data
GCC aligns global data to 16 bytes if their size is >= 16 bytes.  This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 44fa9c1080)
2025-02-28 14:42:29 +00:00
Wilco Dijkstra 5fe151d86a AArch64: Remove zva_128 from memset
Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]).  Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
(cherry picked from commit a08d9a52f9)
2025-02-28 14:42:29 +00:00
Wilco Dijkstra 95aa21432c AArch64: Optimize memset
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes.  Remove unnecessary code for ZVA sizes
other than 64 and 128.  Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit cec3aef324)
2025-02-28 14:42:27 +00:00
Wilco Dijkstra 9ca74b8ad1 AArch64: Improve generic strlen
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates.  Change final
size computation to avoid increasing latency.  On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3dc426b642)
2025-02-28 14:41:56 +00:00
Siddhesh Poyarekar f984e2d7e8 assert: Add test for CVE-2025-0395
Use the __progname symbol to override the program name to induce the
failure that CVE-2025-0395 describes.

This is related to BZ #32582

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit cdb9ba8419)
2025-02-13 13:02:23 -05:00
H.J. Lu 650a0aaaff stdlib: Test using setenv with updated environ [BZ #32588]
Add a test for setenv with updated environ.  Verify that BZ #32588 is
fixed.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8ab34497de)
2025-01-25 10:30:58 +08:00
Siddhesh Poyarekar c32fd59314 Fix underallocation of abort_msg_s struct (CVE-2025-0395)
Include the space needed to store the length of the message itself, in
addition to the message string.  This resolves BZ #32582.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 68ee0f704c)
2025-01-22 17:10:37 +01:00
Florian Weimer 549e7f7c5a elf: Support recursive use of dynamic TLS in interposed malloc
It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS).  The previous workaround from
commit afe42e935b ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.

This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update.  This
is possible because an interposed malloc has a fixed module ID and
TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc.  It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.

The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.

All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.

Fixes commit d2123d6827 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 018f0fc3b8)
2025-01-10 16:49:38 -08:00
Florian Weimer 48642ef1a5 elf: Avoid some free (NULL) calls in _dl_update_slotinfo
This has been confirmed to work around some interposed mallocs.  Here
is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper
in lttng-tools:

  New TLS usage in libgcc_s.so.1, compatibility impact
  <https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/>

Reportedly, this patch also papers over a similar issue when tcmalloc
2.9.1 is not compiled with -ftls-model=initial-exec.  Of course the
goal really should be to compile mallocs with the initial-exec TLS
model, but this commit appears to be a useful interim workaround.

Fixes commit d2123d6827 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit afe42e935b)
2025-01-10 16:48:26 -08:00
Noah Goldstein 12fec8aae5 x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]
The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 483443d321)
2025-01-09 17:23:28 -08:00
Noah Goldstein 04b8d48432 x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 5bf0ab8057)
2025-01-09 17:23:28 -08:00
Gabi Falk dc1762113d x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4)
This code expects the WCSCAT preprocessor macro to be predefined in case
the evex implementation of the function should be defined with a name
different from __wcsncat_evex.  However, when glibc is built for
x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines
WCSNCAT variable instead of WCSCAT to build it as wcsncat.  Rename the
variable to WCSNCAT, as it is actually a better naming choice for the
variable in this case.

Reported-by: Kenton Groombridge
Link: https://bugs.gentoo.org/921945
Fixes: 64b8b6516b ("x86: Add evex optimized functions for the wchar_t strcpy family")
Signed-off-by: Gabi Falk <gabifalk@gmx.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit dd5f891c1a)
2025-01-09 17:23:28 -08:00
H.J. Lu 0d14bf0754 sysdeps/x86/Makefile: Split and sort tests
Put each test on a separate line and sort tests.

(cherry picked from commit 7e03e0de7e)
2025-01-09 17:23:28 -08:00
Noah Goldstein 5a64f93365 x86: Only align destination to 1x VEC_SIZE in memset 4x loop
Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>

(cherry picked from commit 9469261cf1)
2025-01-09 17:23:28 -08:00
Szabolcs Nagy 7772f9358c elf: Fix slow tls access after dlopen [BZ #19924]
In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.

It may be possible to detect up-to-date dtv faster.  But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.

This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once.  The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.

This patch uses acquire/release synchronization when accessing the
generation counter.

Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.

I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit d2123d6827)
2025-01-09 07:31:25 -08:00
H.J. Lu 58822f954f x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]
The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored.  If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead.  If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all.  This addresses BZ #30643.

(cherry picked from commit 1493622f4f)
2025-01-09 07:31:17 -08:00
H.J. Lu c92946d9b2 x86_64: Add log1p with FMA
On Skylake, it changes log1p bench performance by:

        Before       After     Improvement
max     63.349       58.347       8%
min     4.448        5.651        -30%
mean    12.0674      10.336       14%

The minimum code path is

 if (hx < 0x3FDA827A)                          /* x < 0.41422  */
    {
      if (__glibc_unlikely (ax >= 0x3ff00000))           /* x <= -1.0 */
        {
	   ...
        }
      if (__glibc_unlikely (ax < 0x3e200000))           /* |x| < 2**-29 */
        {
          math_force_eval (two54 + x);          /* raise inexact */
          if (ax < 0x3c900000)                  /* |x| < 2**-54 */
            {
	      ...
            }
          else
            return x - x * x * 0.5;

FMA and non-FMA code sequences look similar.  Non-FMA version is slightly
faster.  Since log1p is called by asinh and atanh, it improves asinh
performance by:

        Before       After     Improvement
max     75.645       63.135       16%
min     10.074       10.071       0%
mean    15.9483      14.9089      6%

and improves atanh performance by:

        Before       After     Improvement
max     91.768       75.081       18%
min     15.548       13.883       10%
mean    18.3713      16.8011      8%

(cherry picked from commit a8ecb126d4)
2025-01-09 07:31:04 -08:00
H.J. Lu b2a45f1eee x86_64: Add expm1 with FMA
On Skylake, it improves expm1 bench performance by:

        Before       After     Improvement
max     70.204       68.054       3%
min     20.709       16.2         22%
mean    22.1221      16.7367      24%

NB: Add

extern long double __expm1l (long double);
extern long double __expm1f128 (long double);

for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.

(cherry picked from commit 1b214630ce)
2025-01-09 07:30:51 -08:00
H.J. Lu 49016f2190 x86_64: Add log2 with FMA
On Skylake, it improves log2 bench performance by:

        Before       After     Improvement
max     208.779      63.827       69%
min     9.977        6.55         34%
mean    10.366       6.8191       34%

(cherry picked from commit f6b10ed8e9)
2025-01-09 07:30:41 -08:00
H.J. Lu 5c9be512ee x86_64: Sort fpu/multiarch/Makefile
Sort Makefile variables using scripts/sort-makefile-lines.py.

No code generation changes observed in libm.  No regressions on x86_64.

(cherry picked from commit 881546979d)
2025-01-09 07:30:32 -08:00
Florian Weimer cf06772360 x86: Avoid integer truncation with large cache sizes (bug 32470)
Some hypervisors report 1 TiB L3 cache size.  This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.

(cherry picked from commit 61c3450db9)
2024-12-17 18:53:16 +01:00
Michael Jeanson 37ded328c4 nptl: initialize cpu_id_start prior to rseq registration
When adding explicit initialization of rseq fields prior to
registration, I glossed over the fact that 'cpu_id_start' is also
documented as initialized by user-space.

While current kernels don't validate the content of this field on
registration, future ones could.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit d9f40387d3)
2024-12-06 16:01:45 +00:00
Michael Jeanson 9423cc5387 nptl: initialize rseq area prior to registration
Per the rseq syscall documentation, 3 fields are required to be
initialized by userspace prior to registration, they are 'cpu_id',
'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread'
is cleared on all architectures, explicitly set those 3 fields prior to
registration.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 97f60abd25)
2024-12-06 16:01:45 +00:00
Florian Weimer fa4ad10406 elf: Change ldconfig auxcache magic number (bug 32231)
In commit c628c22963 (elf: Remove
ldconfig kernel version check), the layout of auxcache entries
changed because the osversion field was removed from
struct aux_cache_file_entry.  However, AUX_CACHEMAGIC was not
changed, so existing files are still used, potentially leading
to unintended ldconfig behavior.  This commit changes AUX_CACHEMAGIC,
so that the file is regenerated.

Reported-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
(cherry picked from commit 0a536f6e2f)
2024-10-29 10:00:32 +01:00
H.J. Lu 4dd8641461 Add crt1-2.0.o for glibc 2.0 compatibility tests
Starting from glibc 2.1, crt1.o contains _IO_stdin_used which is checked
by _IO_check_libio to provide binary compatibility for glibc 2.0.  Add
crt1-2.0.o for tests against glibc 2.0.  Define tests-2.0 for glibc 2.0
compatibility tests.  Add and update glibc 2.0 compatibility tests for
stderr, matherr and pthread_kill.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit 5f245f3bfb)
2024-10-01 10:33:51 +08:00
Siddhesh Poyarekar 370be85892 libio: Attempt wide backup free only for non-legacy code
_wide_data and _mode are not available in legacy code, so do not attempt
to free the wide backup buffer in legacy code.

Resolves: BZ #32137 and BZ #27821

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit ae4d44b1d5)
2024-09-11 08:51:55 +02:00
Maciej W. Rozycki f30501ca75 nptl: Use <support/check.h> facilities in tst-setuid3
Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>,
which provides equivalent reporting, with the name of the file and the
line number within of the failure site additionally included.  Remove
FAIL_ERR altogether and include ": %m" explicitly with the format string
supplied to FAIL_EXIT1 as there seems little value to have a separate
macro just for this.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8c98195af6)
2024-08-30 15:28:42 -04:00
Maciej W. Rozycki 15ca66303f posix: Use <support/check.h> facilities in tst-truncate and tst-truncate64
Remove local FAIL macro in favor to FAIL_RET from <support/check.h>,
which provides equivalent reporting, with the name of the file of the
failure site additionally included, for the tst-truncate-common core
shared between the tst-truncate and tst-truncate64 tests.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit fe47595504)
2024-08-30 15:28:38 -04:00
Siddhesh Poyarekar b9f72bd5de ungetc: Fix backup buffer leak on program exit [BZ #27821]
If a file descriptor is left unclosed and is cleaned up by _IO_cleanup
on exit, its backup buffer remains unfreed, registering as a leak in
valgrind.  This is not strictly an issue since (1) the program should
ideally be closing the stream once it's not in use and (2) the program
is about to exit anyway, so keeping the backup buffer around a wee bit
longer isn't a real problem.  Free it anyway to keep valgrind happy
when the streams in question are the standard ones, i.e. stdout, stdin
or stderr.

Also, the _IO_have_backup macro checks for _IO_save_base,
which is a roundabout way to check for a backup buffer instead of
directly looking for _IO_backup_base.  The roundabout check breaks when
the main get area has not been used and user pushes a char into the
backup buffer with ungetc.  Fix this to use the _IO_backup_base
directly.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3e1d8d1d1d)
2024-08-28 16:45:25 -04:00