glibc

Commit Graph

Author	SHA1	Message	Date
Florian Weimer	51210d6496	nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786) [BZ #25847] The new initializer and struct layout does not initialize the __g_signals field in the old struct layout before the change in commit `c36fc50781` ("nptl: Remove g_refs from condition variables"). Bring back fields at the end of struct __pthread_cond_s, so that they are again zero-initialized. (cherry picked from commit `dbc5a50d12`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:56 -04:00
Malte Skarupke	39a80f4035	nptl: Use all of g1_start and g_signals [BZ #25847] The LSB of g_signals was unused. The LSB of g1_start was used to indicate which group is G2. This was used to always go to sleep in pthread_cond_wait if a waiter is in G2. A comment earlier in the file says that this is not correct to do: "Waiters cannot determine whether they are currently in G2 or G1 -- but they do not have to because all they are interested in is whether there are available signals" I either would have had to update the comment, or get rid of the check. I chose to get rid of the check. In fact I don't quite know why it was there. There will never be available signals for group G2, so we didn't need the special case. Even if there were, this would just be a spurious wake. This might have caught some cases where the count has wrapped around, but it wouldn't reliably do that, (and even if it did, why would you want to force a sleep in that case?) and we don't support that many concurrent waiters anyway. Getting rid of it allows us to use one more bit, making us more robust to wraparound. (cherry picked from commit `91bb902f58`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:09 -04:00
Malte Skarupke	8899e89b29	nptl: rename __condvar_quiesce_and_switch_g1 [BZ #25847] This function no longer waits for threads to leave g1, so rename it to __condvar_switch_g1 (cherry picked from commit `4b79e27a50`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:03 -04:00
Malte Skarupke	5765653697	nptl: Fix indentation [BZ #25847] In my previous change I turned a nested loop into a simple loop. I'm doing the resulting indentation changes in a separate commit to make the diff on the previous commit easier to review. (cherry picked from commit `ee6c14ed59`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:48 -04:00
Malte Skarupke	6bac834c5a	nptl: Use a single loop in pthread_cond_wait instaed of a nested loop [BZ #25847] The loop was a little more complicated than necessary. There was only one break statement out of the inner loop, and the outer loop was nearly empty. So just remove the outer loop, moving its code to the one break statement in the inner loop. This allows us to replace all gotos with break statements. (cherry picked from commit `929a4764ac`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:40 -04:00
Malte Skarupke	7625579f11	nptl: Remove g_refs from condition variables [BZ #25847] This variable used to be needed to wait in group switching until all sleepers have confirmed that they have woken. This is no longer needed. Nothing waits on this variable so there is no need to track how many threads are currently asleep in each group. (cherry picked from commit `c36fc50781`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:31 -04:00
Malte Skarupke	44eaf0615d	nptl: Remove unnecessary quadruple check in pthread_cond_wait [BZ #25847] pthread_cond_wait was checking whether it was in a closed group no less than four times. Checking once is enough. Here are the four checks: 1. While spin-waiting. This was dead code: maxspin is set to 0 and has been for years. 2. Before deciding to go to sleep, and before incrementing grefs: I kept this 3. After incrementing grefs. There is no reason to think that the group would close while we do an atomic increment. Obviously it could close at any point, but that doesn't mean we have to recheck after every step. This check was equally good as check 2, except it has to do more work. 4. When we find ourselves in a group that has a signal. We only get here after we check that we're not in a closed group. There is no need to check again. The check would only have helped in cases where the compare_exchange in the next line would also have failed. Relying on the compare_exchange is fine. Removing the duplicate checks clarifies the code. (cherry picked from commit `4f7b051f8e`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:25 -04:00
Malte Skarupke	1fa5e51897	nptl: Remove unnecessary catch-all-wake in condvar group switch [BZ #25847] This wake is unnecessary. We only switch groups after every sleeper in a group has been woken. Sure, they may take a while to actually wake up and may still hold a reference, but waking them a second time doesn't speed that up. Instead this just makes the code more complicated and may hide problems. In particular this safety wake wouldn't even have helped with the bug that was fixed by Barrus' patch: The bug there was that pthread_cond_signal would not switch g1 when it should, so we wouldn't even have entered this code path. (cherry picked from commit `b42cc6af11`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:18 -04:00
Malte Skarupke	b5c4727e59	nptl: Update comments and indentation for new condvar implementation [BZ #25847] Some comments were wrong after the most recent commit. This fixes that. Also fixing indentation where it was using spaces instead of tabs. (cherry picked from commit `0cc973160c`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:01 -04:00
Frank Barrus	1a0d73a625	pthreads NPTL: lost wakeup fix 2 [BZ #25847] This fixes the lost wakeup (from a bug in signal stealing) with a change in the usage of g_signals[] in the condition variable internal state. It also completely eliminates the concept and handling of signal stealing, as well as the need for signalers to block to wait for waiters to wake up every time there is a G1/G2 switch. This greatly reduces the average and maximum latency for pthread_cond_signal. The g_signals[] field now contains a signal count that is relative to the current g1_start value. Since it is a 32-bit field, and the LSB is still reserved (though not currently used anymore), it has a 31-bit value that corresponds to the low 31 bits of the sequence number in g1_start. (since g1_start also has an LSB flag, this means bits 31:1 in g_signals correspond to bits 31:1 in g1_start, plus the current signal count) By making the signal count relative to g1_start, there is no longer any ambiguity or A/B/A issue, and thus any checks before blocking, including the futex call itself, are guaranteed not to block if the G1/G2 switch occurs, even if the signal count remains the same. This allows initially safely blocking in G2 until the switch to G1 occurs, and then transitioning from G1 to a new G1 or G2, and always being able to distinguish the state change. This removes the race condition and A/B/A problems that otherwise ocurred if a late (pre-empted) waiter were to resume just as the futex call attempted to block on g_signal since otherwise there was no last opportunity to re-check things like whether the current G1 group was already closed. By fixing these issues, the signal stealing code can be eliminated, since there is no concept of signal stealing anymore. The code to block for all waiters to exit g_refs can also be removed, since any waiters that are still in the g_refs region can be guaranteed to safely wake up and exit. If there are still any left at this time, they are all sent one final futex wakeup to ensure that they are not blocked any longer, but there is no need for the signaller to block and wait for them to wake up and exit the g_refs region. The signal count is then effectively "zeroed" but since it is now relative to g1_start, this is done by advancing it to a new value that can be observed by any pending blocking waiters. Any late waiters can always tell the difference, and can thus just cleanly exit if they are in a stale G1 or G2. They can never steal a signal from the current G1 if they are not in the current G1, since the signal value that has to match in the cmpxchg has the low 31 bits of the g1_start value contained in it, and that's first checked, and then it won't match if there's a G1/G2 change. Note: the 31-bit sequence number used in g_signals is designed to handle wrap-around when checking the signal count, but if the entire 31-bit wraparound (2 billion signals) occurs while there is still a late waiter that has not yet resumed, and it happens to then match the current g1_start low bits, and the pre-emption occurs after the normal "closed group" checks (which are 64-bit) but then hits the futex syscall and signal consuming code, then an A/B/A issue could still result and cause an incorrect assumption about whether it should block. This particular scenario seems unlikely in practice. Note that once awake from the futex, the waiter would notice the closed group before consuming the signal (since that's still a 64-bit check that would not be aliased in the wrap-around in g_signals), so the biggest impact would be blocking on the futex until the next full wakeup from a G1/G2 switch. (cherry picked from commit `1db84775f8`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:54:25 -04:00
Florian Weimer	5a6276d97a	Fix error reporting (false negatives) in SGID tests And simplify the interface of support_capture_subprogram_self_sgid. Use the existing framework for temporary directories (now with mode 0700) and directory/file deletion. Handle all execution errors within support_capture_subprogram_self_sgid. In particular, this includes test failures because the invoked program did not exit with exit status zero. Existing tests that expect exit status 42 are adjusted to use zero instead. In addition, fix callers not to call exit (0) with test failures pending (which may mask them, especially when running with --direct). Fixes commit `35fc356fa3` ("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)"). Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `3a3fb2ed83`)	2025-06-20 11:12:24 +02:00
Florian Weimer	81f58dd9b7	support: Pick group in support_capture_subprogram_self_sgid if UID == 0 When running as root, it is likely that we can run under any group. Pick a harmless group from /etc/group in this case. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `2f769cec44`)	2025-06-20 11:10:41 +02:00
Florian Weimer	ca7e32d024	elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987) This should really move into support_capture_subprogram_self_sgid. Reviewed-by: Sam James <sam@gentoo.org> (cherry picked from commit `35fc356fa3`)	2025-05-21 08:57:17 +02:00
Sunil K Pandey	ca41fe44a5	x86_64: Fix typo in ifunc-impl-list.c. Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `f2aeb6ff94`)	2025-05-20 16:49:58 -07:00
Florian Weimer	31fa0f73e2	elf: Test case for bug 32976 (CVE-2025-4802) Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically linked binaries, using support_capture_subprogram_self_sgid. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `d8f7a79335`)	2025-05-20 20:26:07 +02:00
Florian Weimer	4335cd9b58	support: Add support_record_failure_barrier This can be used to stop execution after a TEST_COMPARE_BLOB failure, for example. (cherry picked from commit `d0b8aa6de4`)	2025-05-20 20:26:07 +02:00
Florian Weimer	454f24e981	support: Use const char * argument in support_capture_subprogram_self_sgid The function does not modify the passed-in string, so make this clear via the prototype. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `f0c09fe616`)	2025-05-20 20:26:07 +02:00
Adhemerval Zanella	3be3728df2	elf: Ignore LD_LIBRARY_PATH and debug env var for setuid for static It mimics the ld.so behavior. Checked on x86_64-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit `5451fa962c`) Changes: git/elf/dl-support.c (missing commit `55f41ef8de` ("elf: Remove LD_PROFILE for static binaries"))	2025-05-20 20:25:57 +02:00
Wilco Dijkstra	5a08d049dc	math: Improve layout of exp/exp10 data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp_data struct slightly so that the fields are better aligned and without gaps. As a result on targets that support them, more load-pair instructions are used in exp. The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on Neoverse V2. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `5afaf99edb`)	2025-02-28 14:52:00 +00:00
Wilco Dijkstra	097299ffa9	AArch64: Use prefer_sve_ifuncs for SVE memset Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit `0f044be1da`)	2025-02-28 14:44:25 +00:00
Wilco Dijkstra	52c2b1556f	AArch64: Add SVE memset Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit `163b1bbb76`)	2025-02-28 14:44:23 +00:00
Wilco Dijkstra	3de5112326	math: Improve layout of expf data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp2f_data struct slightly so that the fields are better aligned. As a result on targets that support them, load-pair instructions accessing poly_scaled and invln2_scaled are now 16-byte aligned. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `44fa9c1080`)	2025-02-28 14:42:29 +00:00
Wilco Dijkstra	5fe151d86a	AArch64: Remove zva_128 from memset Remove ZVA 128 support from memset - the new memset no longer guarantees count >= 256, which can result in underflow and a crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA size of 128 and its memcpy implementation was removed in commit `e162ab2bf1`, remove this special case too. [1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com> (cherry picked from commit `a08d9a52f9`)	2025-02-28 14:42:29 +00:00
Wilco Dijkstra	95aa21432c	AArch64: Optimize memset Improve small memsets by avoiding branches and use overlapping stores. Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes other than 64 and 128. Performance of random memset benchmark improves by 24% on Neoverse N1. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `cec3aef324`)	2025-02-28 14:42:27 +00:00
Wilco Dijkstra	9ca74b8ad1	AArch64: Improve generic strlen Improve performance by handling another 16 bytes before entering the loop. Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final size computation to avoid increasing latency. On Neoverse V1 performance of the random strlen benchmark improves by 4.6%. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `3dc426b642`)	2025-02-28 14:41:56 +00:00
Siddhesh Poyarekar	f984e2d7e8	assert: Add test for CVE-2025-0395 Use the __progname symbol to override the program name to induce the failure that CVE-2025-0395 describes. This is related to BZ #32582 Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `cdb9ba8419`)	2025-02-13 13:02:23 -05:00
H.J. Lu	650a0aaaff	stdlib: Test using setenv with updated environ [BZ #32588 ] Add a test for setenv with updated environ. Verify that BZ #32588 is fixed. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `8ab34497de`)	2025-01-25 10:30:58 +08:00
Siddhesh Poyarekar	c32fd59314	Fix underallocation of abort_msg_s struct (CVE-2025-0395) Include the space needed to store the length of the message itself, in addition to the message string. This resolves BZ #32582. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `68ee0f704c`)	2025-01-22 17:10:37 +01:00
Florian Weimer	549e7f7c5a	elf: Support recursive use of dynamic TLS in interposed malloc It turns out that quite a few applications use bundled mallocs that have been built to use global-dynamic TLS (instead of the recommended initial-exec TLS). The previous workaround from commit `afe42e935b` ("elf: Avoid some free (NULL) calls in _dl_update_slotinfo") does not fix all encountered cases unfortunatelly. This change avoids the TLS generation update for recursive use of TLS from a malloc that was called during a TLS update. This is possible because an interposed malloc has a fixed module ID and TLS slot. (It cannot be unloaded.) If an initially-loaded module ID is encountered in __tls_get_addr and the dynamic linker is already in the middle of a TLS update, use the outdated DTV, thus avoiding another call into malloc. It's still necessary to update the DTV to the most recent generation, to get out of the slow path, which is why the check for recursion is needed. The bookkeeping is done using a global counter instead of per-thread flag because TLS access in the dynamic linker is tricky. All this will go away once the dynamic linker stops using malloc for TLS, likely as part of a change that pre-allocates all TLS during pthread_create/dlopen. Fixes commit `d2123d6827` ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit `018f0fc3b8`)	2025-01-10 16:49:38 -08:00
Florian Weimer	48642ef1a5	elf: Avoid some free (NULL) calls in _dl_update_slotinfo This has been confirmed to work around some interposed mallocs. Here is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper in lttng-tools: New TLS usage in libgcc_s.so.1, compatibility impact <https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/> Reportedly, this patch also papers over a similar issue when tcmalloc 2.9.1 is not compiled with -ftls-model=initial-exec. Of course the goal really should be to compile mallocs with the initial-exec TLS model, but this commit appears to be a useful interim workaround. Fixes commit `d2123d6827` ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `afe42e935b`)	2025-01-10 16:48:26 -08:00
Noah Goldstein	12fec8aae5	x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212 ] The loop should be aligned to 32-bytes so that it can ideally run out the DSB. This is particularly important on Skylake-Server where deficiencies in it's DSB implementation make it prone to not being able to run loops out of the DSB. For example running strcmp-evex on 200Mb string: 32-byte aligned loop: - 43,399,578,766 idq.dsb_uops not 32-byte aligned loop: - 6,060,139,704 idq.dsb_uops This results in a 25% performance degradation for the non-aligned version. The fix is to just ensure the code layout is such that the loop is aligned. (Which was previously the case but was accidentally dropped in `84e7c46df`). NB: The fix was actually 64-byte alignment. This is because 64-byte alignment generally produces more stable performance than 32-byte aligned code (cache line crosses can affect perf), so if we are going past 16-byte alignmnent, might as well go to 64. 64-byte alignment also matches most other functions we over-align, so it creates a common point of optimization. Times are reported as ratio of Time_With_Patch / Time_Without_Patch. Lower is better. The values being reported is the geometric mean of the ratio across all tests in bench-strcmp and bench-strncmp. Note this patch is only attempting to improve the Skylake-Server strcmp for long strings. The rest of the numbers are only to test for regressions. Tigerlake Results Strings <= 512: strcmp : 1.026 strncmp: 0.949 Tigerlake Results Strings > 512: strcmp : 0.994 strncmp: 0.998 Skylake-Server Results Strings <= 512: strcmp : 0.945 strncmp: 0.943 Skylake-Server Results Strings > 512: strcmp : 0.778 strncmp: 1.000 The 2.6% regression on TGL-strcmp is due to slowdowns caused by changes in alignment of code handling small sizes (most on the page-cross logic). These should be safe to ignore because 1) We previously only 16-byte aligned the function so this behavior is not new and was essentially up to chance before this patch and 2) this type of alignment related regression on small sizes really only comes up in tight micro-benchmark loops and is unlikely to have any affect on realworld performance. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `483443d321`)	2025-01-09 17:23:28 -08:00
Noah Goldstein	04b8d48432	x86: Improve large memset perf with non-temporal stores [RHEL-29312] Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5bf0ab8057`)	2025-01-09 17:23:28 -08:00
Gabi Falk	dc1762113d	x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) This code expects the WCSCAT preprocessor macro to be predefined in case the evex implementation of the function should be defined with a name different from __wcsncat_evex. However, when glibc is built for x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the variable to WCSNCAT, as it is actually a better naming choice for the variable in this case. Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: `64b8b6516b` ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit `dd5f891c1a`)	2025-01-09 17:23:28 -08:00
H.J. Lu	0d14bf0754	sysdeps/x86/Makefile: Split and sort tests Put each test on a separate line and sort tests. (cherry picked from commit `7e03e0de7e`)	2025-01-09 17:23:28 -08:00
Noah Goldstein	5a64f93365	x86: Only align destination to 1x VEC_SIZE in memset 4x loop Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on performance other than potentially resulting in an additional iteration of the loop. 1x maintains aligned stores (the only reason to align in this case) and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit `9469261cf1`)	2025-01-09 17:23:28 -08:00
Szabolcs Nagy	7772f9358c	elf: Fix slow tls access after dlopen [BZ #19924 ] In short: __tls_get_addr checks the global generation counter and if the current dtv is older then _dl_update_slotinfo updates dtv up to the generation of the accessed module. So if the global generation is newer than generation of the module then __tls_get_addr keeps hitting the slow dtv update path. The dtv update path includes a number of checks to see if any update is needed and this already causes measurable tls access slow down after dlopen. It may be possible to detect up-to-date dtv faster. But if there are many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at least walking the slotinfo list. This patch tries to update the dtv to the global generation instead, so after a dlopen the tls access slow path is only hit once. The modules with larger generation than the accessed one were not necessarily synchronized before, so additional synchronization is needed. This patch uses acquire/release synchronization when accessing the generation counter. Note: in the x86_64 version of dl-tls.c the generation is only loaded once, since relaxed mo is not faster than acquire mo load. I have not benchmarked this. Tested by Adhemerval Zanella on aarch64, powerpc, sparc, x86 who reported that it fixes the performance issue of bug 19924. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `d2123d6827`)	2025-01-09 07:31:25 -08:00
H.J. Lu	58822f954f	x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643 ] The old Intel software developer manual specified that the low byte of EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of CPUDID leaf 2 was needed to retrieve the complete cache information. The newer Intel manual has been changed to that it should always return 1 and be ignored. If the lower byte isn't 1, CPUID leaf 2 can't be used. In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead. If CPUID leaf 4 doesn't contain the cache information, cache information isn't available at all. This addresses BZ #30643. (cherry picked from commit `1493622f4f`)	2025-01-09 07:31:17 -08:00
H.J. Lu	c92946d9b2	x86_64: Add log1p with FMA On Skylake, it changes log1p bench performance by: Before After Improvement max 63.349 58.347 8% min 4.448 5.651 -30% mean 12.0674 10.336 14% The minimum code path is if (hx < 0x3FDA827A) /* x < 0.41422 / { if (__glibc_unlikely (ax >= 0x3ff00000)) / x <= -1.0 / { ... } if (__glibc_unlikely (ax < 0x3e200000)) / \|x\| < 2*-29 / { math_force_eval (two54 + x); /* raise inexact / if (ax < 0x3c900000) / \|x\| < 2*-54 / { ... } else return x - x * x * 0.5; FMA and non-FMA code sequences look similar. Non-FMA version is slightly faster. Since log1p is called by asinh and atanh, it improves asinh performance by: Before After Improvement max 75.645 63.135 16% min 10.074 10.071 0% mean 15.9483 14.9089 6% and improves atanh performance by: Before After Improvement max 91.768 75.081 18% min 15.548 13.883 10% mean 18.3713 16.8011 8% (cherry picked from commit `a8ecb126d4`)	2025-01-09 07:31:04 -08:00
H.J. Lu	b2a45f1eee	x86_64: Add expm1 with FMA On Skylake, it improves expm1 bench performance by: Before After Improvement max 70.204 68.054 3% min 20.709 16.2 22% mean 22.1221 16.7367 24% NB: Add extern long double __expm1l (long double); extern long double __expm1f128 (long double); for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is defined since __expm1 may be expanded in their declarations which causes the build failure. (cherry picked from commit `1b214630ce`)	2025-01-09 07:30:51 -08:00
H.J. Lu	49016f2190	x86_64: Add log2 with FMA On Skylake, it improves log2 bench performance by: Before After Improvement max 208.779 63.827 69% min 9.977 6.55 34% mean 10.366 6.8191 34% (cherry picked from commit `f6b10ed8e9`)	2025-01-09 07:30:41 -08:00
H.J. Lu	5c9be512ee	x86_64: Sort fpu/multiarch/Makefile Sort Makefile variables using scripts/sort-makefile-lines.py. No code generation changes observed in libm. No regressions on x86_64. (cherry picked from commit `881546979d`)	2025-01-09 07:30:32 -08:00
Florian Weimer	cf06772360	x86: Avoid integer truncation with large cache sizes (bug 32470) Some hypervisors report 1 TiB L3 cache size. This results in some variables incorrectly getting zeroed, causing crashes in memcpy/memmove because invariants are violated. (cherry picked from commit `61c3450db9`)	2024-12-17 18:53:16 +01:00
Michael Jeanson	37ded328c4	nptl: initialize cpu_id_start prior to rseq registration When adding explicit initialization of rseq fields prior to registration, I glossed over the fact that 'cpu_id_start' is also documented as initialized by user-space. While current kernels don't validate the content of this field on registration, future ones could. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> (cherry picked from commit `d9f40387d3`)	2024-12-06 16:01:45 +00:00
Michael Jeanson	9423cc5387	nptl: initialize rseq area prior to registration Per the rseq syscall documentation, 3 fields are required to be initialized by userspace prior to registration, they are 'cpu_id', 'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread' is cleared on all architectures, explicitly set those 3 fields prior to registration. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `97f60abd25`)	2024-12-06 16:01:45 +00:00
Florian Weimer	fa4ad10406	elf: Change ldconfig auxcache magic number (bug 32231) In commit `c628c22963` (elf: Remove ldconfig kernel version check), the layout of auxcache entries changed because the osversion field was removed from struct aux_cache_file_entry. However, AUX_CACHEMAGIC was not changed, so existing files are still used, potentially leading to unintended ldconfig behavior. This commit changes AUX_CACHEMAGIC, so that the file is regenerated. Reported-by: DJ Delorie <dj@redhat.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `0a536f6e2f`)	2024-10-29 10:00:32 +01:00
H.J. Lu	4dd8641461	Add crt1-2.0.o for glibc 2.0 compatibility tests Starting from glibc 2.1, crt1.o contains _IO_stdin_used which is checked by _IO_check_libio to provide binary compatibility for glibc 2.0. Add crt1-2.0.o for tests against glibc 2.0. Define tests-2.0 for glibc 2.0 compatibility tests. Add and update glibc 2.0 compatibility tests for stderr, matherr and pthread_kill. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `5f245f3bfb`)	2024-10-01 10:33:51 +08:00
Siddhesh Poyarekar	370be85892	libio: Attempt wide backup free only for non-legacy code _wide_data and _mode are not available in legacy code, so do not attempt to free the wide backup buffer in legacy code. Resolves: BZ #32137 and BZ #27821 Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `ae4d44b1d5`)	2024-09-11 08:51:55 +02:00
Maciej W. Rozycki	f30501ca75	nptl: Use <support/check.h> facilities in tst-setuid3 Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>, which provides equivalent reporting, with the name of the file and the line number within of the failure site additionally included. Remove FAIL_ERR altogether and include ": %m" explicitly with the format string supplied to FAIL_EXIT1 as there seems little value to have a separate macro just for this. Reviewed-by: DJ Delorie <dj@redhat.com> (cherry picked from commit `8c98195af6`)	2024-08-30 15:28:42 -04:00
Maciej W. Rozycki	15ca66303f	posix: Use <support/check.h> facilities in tst-truncate and tst-truncate64 Remove local FAIL macro in favor to FAIL_RET from <support/check.h>, which provides equivalent reporting, with the name of the file of the failure site additionally included, for the tst-truncate-common core shared between the tst-truncate and tst-truncate64 tests. Reviewed-by: DJ Delorie <dj@redhat.com> (cherry picked from commit `fe47595504`)	2024-08-30 15:28:38 -04:00
Siddhesh Poyarekar	b9f72bd5de	ungetc: Fix backup buffer leak on program exit [BZ #27821 ] If a file descriptor is left unclosed and is cleaned up by _IO_cleanup on exit, its backup buffer remains unfreed, registering as a leak in valgrind. This is not strictly an issue since (1) the program should ideally be closing the stream once it's not in use and (2) the program is about to exit anyway, so keeping the backup buffer around a wee bit longer isn't a real problem. Free it anyway to keep valgrind happy when the streams in question are the standard ones, i.e. stdout, stdin or stderr. Also, the _IO_have_backup macro checks for _IO_save_base, which is a roundabout way to check for a backup buffer instead of directly looking for _IO_backup_base. The roundabout check breaks when the main get area has not been used and user pushes a char into the backup buffer with ungetc. Fix this to use the _IO_backup_base directly. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `3e1d8d1d1d`)	2024-08-28 16:45:25 -04:00

1 2 3 4 5 ...

40385 Commits All Branches Search

40385 Commits

All Branches