glibc

Commit Graph

Author	SHA1	Message	Date
Florian Weimer	e4bd849a5d	debug: Fix tst-longjmp_chk3 build failure on Hurd Explicitly include <unistd.h> for _exit and getpid. (cherry picked from commit `4836a9af89`)	2025-08-19 09:01:21 -07:00
Florian Weimer	4db5358c9c	debug: Wire up tst-longjmp_chk3 The test was added in commit `ac8cc9e300` without all the required Makefile scaffolding. Tweak the test so that it actually builds (including with dynamic SIGSTKSZ). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `4b7cfcc3fb`)	2025-08-19 08:54:28 -07:00
Adhemerval Zanella	1fc2002d60	debug: Adapt fortify tests to libsupport Checked on aarch64, armhf, x86_64, and i686. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit `9556acd249`)	2025-08-19 08:28:19 -07:00
H.J. Lu	9bb20abcaa	x86-64: Save APX registers in ld.so trampoline Add APX registers to STATE_SAVE_MASK so that APX registers are saved in ld.so trampoline. This fixes BZ #31371. Also update STATE_SAVE_OFFSET and STATE_SAVE_MASK for i386 which will be used by i386 _dl_tlsdesc_dynamic. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit `dfb05f8e70`)	2025-08-19 07:37:54 -07:00
Adhemerval Zanella	768103fa49	i386: Remove CET support CET is only support for x86_64, this patch reverts: - `faaee1f07e` x86: Support shadow stack pointer in setjmp/longjmp. - `be9ccd27c0` i386: Add _CET_ENDBR to indirect jump targets in add_n.S/sub_n.S - `c02695d776` x86/CET: Update vfork to prevent child return - `5d844e1b72` i386: Enable CET support in ucontext functions - `124bcde683` x86: Add _CET_ENDBR to functions in crti.S - `562837c002` x86: Add _CET_ENDBR to functions in dl-tlsdesc.S - `f753fa7dea` x86: Support IBT and SHSTK in Intel CET [BZ #21598] - `825b58f3fb` i386-mcount.S: Add _CET_ENDBR to _mcount and __fentry__ - `7e119cd582` i386: Use _CET_NOTRACK in i686/memcmp.S - `177824e232` i386: Use _CET_NOTRACK in memcmp-sse4.S - `0a899af097` i386: Use _CET_NOTRACK in memcpy-ssse3-rep.S - `7fb613361c` i386: Use _CET_NOTRACK in memcpy-ssse3.S - `77a8ae0948` i386: Use _CET_NOTRACK in memset-sse2-rep.S - `00e7b76a8f` i386: Use _CET_NOTRACK in memset-sse2.S - `90d15dc577` i386: Use _CET_NOTRACK in strcat-sse2.S - `f1574581c7` i386: Use _CET_NOTRACK in strcpy-sse2.S - `4031d7484a` i386/sub_n.S: Add a missing _CET_ENDBR to indirect jump - target - Checked on i686-linux-gnu. (cherry picked from commit `25f1e16ef0`)	2025-08-19 06:29:04 -07:00
H.J. Lu	c69b88fc71	x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212 ] When the linker -z mark-plt option is used to add DT_X86_64_PLT, DT_X86_64_PLTSZ and DT_X86_64_PLTENT, the r_addend field of the R_X86_64_JUMP_SLOT relocation stores the offset of the indirect branch instruction. However, glibc versions without the commit: commit `f8587a6189` Author: H.J. Lu <hjl.tools@gmail.com> Date: Fri May 20 19:21:48 2022 -0700 x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we can ignore their r_addends. Reviewed-by: Fangrui Song <maskray@google.com> won't ignore the r_addend value in the R_X86_64_JUMP_SLOT relocation. Such programs and shared libraries will fail at run-time randomly. Add GLIBC_ABI_DT_X86_64_PLT version to indicate that glibc is compatible with DT_X86_64_PLT. The linker can add the glibc GLIBC_ABI_DT_X86_64_PLT version dependency whenever -z mark-plt is passed to the linker. The resulting programs and shared libraries will fail to load at run-time against libc.so without the GLIBC_ABI_DT_X86_64_PLT version, instead of fail randomly. This fixes BZ #33212. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org> (cherry picked from commit `399384e0c8`)	2025-08-15 22:42:29 +01:00
Florian Weimer	1bdce25455	elf: Handle ld.so with LOAD segment gaps in _dl_find_object (bug 31943) Detect if ld.so not contiguous and handle that case in _dl_find_object. Set l_find_object_processed even for initially loaded link maps, otherwise dlopen of an initially loaded object adds it to _dlfo_loaded_mappings (where maps are expected to be contiguous), in addition to _dlfo_nodelete_mappings. Test elf/tst-link-map-contiguous-ldso iterates over the loader image, reading every word to make sure memory is actually mapped. It only does that if the l_contiguous flag is set for the link map. Otherwise, it finds gaps with mmap and checks that _dl_find_object does not return the ld.so mapping for them. The test elf/tst-link-map-contiguous-main does the same thing for the libc.so shared object. This only works if the kernel loaded the main program because the glibc dynamic loader may fill the gaps with PROT_NONE mappings in some cases, making it contiguous, but accesses to individual words may still fault. Test elf/tst-link-map-contiguous-libc is again slightly different because the dynamic loader always fills the gaps with PROT_NONE mappings, so a different form of probing has to be used. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `20681be149`)	2025-08-14 09:08:15 +02:00
Florian Weimer	c35196339c	elf: Extract rtld_setup_phdr function from dl_main Remove historic binutils reference from comment and update how this data is used by applications. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `2cac9559e0`)	2025-08-14 09:07:44 +02:00
Florian Weimer	93fa28a752	elf: Do not add a copy of _dl_find_object to libc.so This reduces code size and dependencies on ld.so internals from libc.so. Fixes commit `f4c142bb9f` ("arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405)"). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `96429bcc91`)	2025-08-14 09:07:44 +02:00
Adhemerval Zanella	40aa3ade66	arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405) Instead of __dl_iterate_phdr. On ARM dlfo_eh_frame/dlfo_eh_count maps to PT_ARM_EXIDX vaddr start / length. On a Neoverse N1 machine with 160 cores, the following program: $ cat test.c #include <stdlib.h> #include <pthread.h> #include <assert.h> enum { niter = 1024, ntimes = 128, }; static void * tf (void arg) { int a = (int) arg; for (int i = 0; i < niter; i++) { void p[ntimes]; for (int j = 0; j < ntimes; j++) p[j] = malloc (a * 128); for (int j = 0; j < ntimes; j++) free (p[j]); } return NULL; } int main (int argc, char argv[]) { enum { nthreads = 16 }; pthread_t t[nthreads]; for (int i = 0; i < nthreads; i ++) assert (pthread_create (&t[i], NULL, tf, (void ) i) == 0); for (int i = 0; i < nthreads; i++) { void *r; assert (pthread_join (t[i], &r) == 0); assert (r == NULL); } return 0; } $ arm-linux-gnueabihf-gcc -fsanitize=address test.c -o test Improves from ~15s to 0.5s. Checked on arm-linux-gnueabihf. (cherry picked from commit `f4c142bb9f`)	2025-08-14 09:07:33 +02:00
Florian Weimer	6a52d5cab0	posix: Fix double-free after allocation failure in regcomp (bug 33185) If a memory allocation failure occurs during bracket expression parsing in regcomp, a double-free error may result. Reported-by: Anastasia Belova <abelova@astralinux.ru> Co-authored-by: Paul Eggert <eggert@cs.ucla.edu> Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org> (cherry picked from commit `7ea06e9940`)	2025-07-24 09:50:02 +02:00
Sam James	21019afe65	malloc: cleanup casts in tst-calloc Followup to `c3d1dac96b`. As pointed out by Maciej W. Rozycki, the casts are obviously useless now. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `fc8f253d80`)	2025-07-11 13:57:07 -04:00
Sam James	a637f2c42f	malloc: obscure calloc use in tst-calloc Similar to `a9944a52c9` and `f9493a15ea`, we need to hide calloc use from the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change. First, include tst-malloc-aux.h, but then use `volatile` variables for size. The test passes without the tst-malloc-aux.h change but IMO we want it there for consistency and to avoid future problems (possibly silent). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `c3d1dac96b`)	2025-07-11 13:57:07 -04:00
Sam James	879f0ee122	malloc: add indirection for malloc(-like) functions in tests [BZ #32366 ] GCC 15 introduces allocation dead code removal (DCE) for PR117370 in r15-5255-g7828dc070510f8. This breaks various glibc tests which want to assert various properties of the allocator without doing anything obviously useful with the allocated memory. Alexander Monakov rightly pointed out that we can and should do better than passing -fno-malloc-dce to paper over the problem. Not least because GCC 14 already does such DCE where there's no testing of malloc's return value against NULL, and LLVM has such optimisations too. Handle this by providing malloc (and friends) wrappers with a volatile function pointer to obscure that we're calling malloc (et. al) from the compiler. Reviewed-by: Paul Eggert <eggert@cs.ucla.edu> (cherry picked from commit `a9944a52c9`)	2025-07-11 13:57:07 -04:00
Florian Weimer	51210d6496	nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786) [BZ #25847] The new initializer and struct layout does not initialize the __g_signals field in the old struct layout before the change in commit `c36fc50781` ("nptl: Remove g_refs from condition variables"). Bring back fields at the end of struct __pthread_cond_s, so that they are again zero-initialized. (cherry picked from commit `dbc5a50d12`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:56 -04:00
Malte Skarupke	39a80f4035	nptl: Use all of g1_start and g_signals [BZ #25847] The LSB of g_signals was unused. The LSB of g1_start was used to indicate which group is G2. This was used to always go to sleep in pthread_cond_wait if a waiter is in G2. A comment earlier in the file says that this is not correct to do: "Waiters cannot determine whether they are currently in G2 or G1 -- but they do not have to because all they are interested in is whether there are available signals" I either would have had to update the comment, or get rid of the check. I chose to get rid of the check. In fact I don't quite know why it was there. There will never be available signals for group G2, so we didn't need the special case. Even if there were, this would just be a spurious wake. This might have caught some cases where the count has wrapped around, but it wouldn't reliably do that, (and even if it did, why would you want to force a sleep in that case?) and we don't support that many concurrent waiters anyway. Getting rid of it allows us to use one more bit, making us more robust to wraparound. (cherry picked from commit `91bb902f58`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:09 -04:00
Malte Skarupke	8899e89b29	nptl: rename __condvar_quiesce_and_switch_g1 [BZ #25847] This function no longer waits for threads to leave g1, so rename it to __condvar_switch_g1 (cherry picked from commit `4b79e27a50`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:56:03 -04:00
Malte Skarupke	5765653697	nptl: Fix indentation [BZ #25847] In my previous change I turned a nested loop into a simple loop. I'm doing the resulting indentation changes in a separate commit to make the diff on the previous commit easier to review. (cherry picked from commit `ee6c14ed59`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:48 -04:00
Malte Skarupke	6bac834c5a	nptl: Use a single loop in pthread_cond_wait instaed of a nested loop [BZ #25847] The loop was a little more complicated than necessary. There was only one break statement out of the inner loop, and the outer loop was nearly empty. So just remove the outer loop, moving its code to the one break statement in the inner loop. This allows us to replace all gotos with break statements. (cherry picked from commit `929a4764ac`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:40 -04:00
Malte Skarupke	7625579f11	nptl: Remove g_refs from condition variables [BZ #25847] This variable used to be needed to wait in group switching until all sleepers have confirmed that they have woken. This is no longer needed. Nothing waits on this variable so there is no need to track how many threads are currently asleep in each group. (cherry picked from commit `c36fc50781`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:31 -04:00
Malte Skarupke	44eaf0615d	nptl: Remove unnecessary quadruple check in pthread_cond_wait [BZ #25847] pthread_cond_wait was checking whether it was in a closed group no less than four times. Checking once is enough. Here are the four checks: 1. While spin-waiting. This was dead code: maxspin is set to 0 and has been for years. 2. Before deciding to go to sleep, and before incrementing grefs: I kept this 3. After incrementing grefs. There is no reason to think that the group would close while we do an atomic increment. Obviously it could close at any point, but that doesn't mean we have to recheck after every step. This check was equally good as check 2, except it has to do more work. 4. When we find ourselves in a group that has a signal. We only get here after we check that we're not in a closed group. There is no need to check again. The check would only have helped in cases where the compare_exchange in the next line would also have failed. Relying on the compare_exchange is fine. Removing the duplicate checks clarifies the code. (cherry picked from commit `4f7b051f8e`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:25 -04:00
Malte Skarupke	1fa5e51897	nptl: Remove unnecessary catch-all-wake in condvar group switch [BZ #25847] This wake is unnecessary. We only switch groups after every sleeper in a group has been woken. Sure, they may take a while to actually wake up and may still hold a reference, but waking them a second time doesn't speed that up. Instead this just makes the code more complicated and may hide problems. In particular this safety wake wouldn't even have helped with the bug that was fixed by Barrus' patch: The bug there was that pthread_cond_signal would not switch g1 when it should, so we wouldn't even have entered this code path. (cherry picked from commit `b42cc6af11`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:18 -04:00
Malte Skarupke	b5c4727e59	nptl: Update comments and indentation for new condvar implementation [BZ #25847] Some comments were wrong after the most recent commit. This fixes that. Also fixing indentation where it was using spaces instead of tabs. (cherry picked from commit `0cc973160c`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:55:01 -04:00
Frank Barrus	1a0d73a625	pthreads NPTL: lost wakeup fix 2 [BZ #25847] This fixes the lost wakeup (from a bug in signal stealing) with a change in the usage of g_signals[] in the condition variable internal state. It also completely eliminates the concept and handling of signal stealing, as well as the need for signalers to block to wait for waiters to wake up every time there is a G1/G2 switch. This greatly reduces the average and maximum latency for pthread_cond_signal. The g_signals[] field now contains a signal count that is relative to the current g1_start value. Since it is a 32-bit field, and the LSB is still reserved (though not currently used anymore), it has a 31-bit value that corresponds to the low 31 bits of the sequence number in g1_start. (since g1_start also has an LSB flag, this means bits 31:1 in g_signals correspond to bits 31:1 in g1_start, plus the current signal count) By making the signal count relative to g1_start, there is no longer any ambiguity or A/B/A issue, and thus any checks before blocking, including the futex call itself, are guaranteed not to block if the G1/G2 switch occurs, even if the signal count remains the same. This allows initially safely blocking in G2 until the switch to G1 occurs, and then transitioning from G1 to a new G1 or G2, and always being able to distinguish the state change. This removes the race condition and A/B/A problems that otherwise ocurred if a late (pre-empted) waiter were to resume just as the futex call attempted to block on g_signal since otherwise there was no last opportunity to re-check things like whether the current G1 group was already closed. By fixing these issues, the signal stealing code can be eliminated, since there is no concept of signal stealing anymore. The code to block for all waiters to exit g_refs can also be removed, since any waiters that are still in the g_refs region can be guaranteed to safely wake up and exit. If there are still any left at this time, they are all sent one final futex wakeup to ensure that they are not blocked any longer, but there is no need for the signaller to block and wait for them to wake up and exit the g_refs region. The signal count is then effectively "zeroed" but since it is now relative to g1_start, this is done by advancing it to a new value that can be observed by any pending blocking waiters. Any late waiters can always tell the difference, and can thus just cleanly exit if they are in a stale G1 or G2. They can never steal a signal from the current G1 if they are not in the current G1, since the signal value that has to match in the cmpxchg has the low 31 bits of the g1_start value contained in it, and that's first checked, and then it won't match if there's a G1/G2 change. Note: the 31-bit sequence number used in g_signals is designed to handle wrap-around when checking the signal count, but if the entire 31-bit wraparound (2 billion signals) occurs while there is still a late waiter that has not yet resumed, and it happens to then match the current g1_start low bits, and the pre-emption occurs after the normal "closed group" checks (which are 64-bit) but then hits the futex syscall and signal consuming code, then an A/B/A issue could still result and cause an incorrect assumption about whether it should block. This particular scenario seems unlikely in practice. Note that once awake from the futex, the waiter would notice the closed group before consuming the signal (since that's still a 64-bit check that would not be aliased in the wrap-around in g_signals), so the biggest impact would be blocking on the futex until the next full wakeup from a G1/G2 switch. (cherry picked from commit `1db84775f8`) Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>	2025-07-11 13:54:25 -04:00
Florian Weimer	5a6276d97a	Fix error reporting (false negatives) in SGID tests And simplify the interface of support_capture_subprogram_self_sgid. Use the existing framework for temporary directories (now with mode 0700) and directory/file deletion. Handle all execution errors within support_capture_subprogram_self_sgid. In particular, this includes test failures because the invoked program did not exit with exit status zero. Existing tests that expect exit status 42 are adjusted to use zero instead. In addition, fix callers not to call exit (0) with test failures pending (which may mask them, especially when running with --direct). Fixes commit `35fc356fa3` ("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)"). Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `3a3fb2ed83`)	2025-06-20 11:12:24 +02:00
Florian Weimer	81f58dd9b7	support: Pick group in support_capture_subprogram_self_sgid if UID == 0 When running as root, it is likely that we can run under any group. Pick a harmless group from /etc/group in this case. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `2f769cec44`)	2025-06-20 11:10:41 +02:00
Florian Weimer	ca7e32d024	elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987) This should really move into support_capture_subprogram_self_sgid. Reviewed-by: Sam James <sam@gentoo.org> (cherry picked from commit `35fc356fa3`)	2025-05-21 08:57:17 +02:00
Sunil K Pandey	ca41fe44a5	x86_64: Fix typo in ifunc-impl-list.c. Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `f2aeb6ff94`)	2025-05-20 16:49:58 -07:00
Florian Weimer	31fa0f73e2	elf: Test case for bug 32976 (CVE-2025-4802) Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically linked binaries, using support_capture_subprogram_self_sgid. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `d8f7a79335`)	2025-05-20 20:26:07 +02:00
Florian Weimer	4335cd9b58	support: Add support_record_failure_barrier This can be used to stop execution after a TEST_COMPARE_BLOB failure, for example. (cherry picked from commit `d0b8aa6de4`)	2025-05-20 20:26:07 +02:00
Florian Weimer	454f24e981	support: Use const char * argument in support_capture_subprogram_self_sgid The function does not modify the passed-in string, so make this clear via the prototype. Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `f0c09fe616`)	2025-05-20 20:26:07 +02:00
Adhemerval Zanella	3be3728df2	elf: Ignore LD_LIBRARY_PATH and debug env var for setuid for static It mimics the ld.so behavior. Checked on x86_64-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org> (cherry picked from commit `5451fa962c`) Changes: git/elf/dl-support.c (missing commit `55f41ef8de` ("elf: Remove LD_PROFILE for static binaries"))	2025-05-20 20:25:57 +02:00
Wilco Dijkstra	5a08d049dc	math: Improve layout of exp/exp10 data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp_data struct slightly so that the fields are better aligned and without gaps. As a result on targets that support them, more load-pair instructions are used in exp. The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on Neoverse V2. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `5afaf99edb`)	2025-02-28 14:52:00 +00:00
Wilco Dijkstra	097299ffa9	AArch64: Use prefer_sve_ifuncs for SVE memset Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit `0f044be1da`)	2025-02-28 14:44:25 +00:00
Wilco Dijkstra	52c2b1556f	AArch64: Add SVE memset Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit `163b1bbb76`)	2025-02-28 14:44:23 +00:00
Wilco Dijkstra	3de5112326	math: Improve layout of expf data GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp2f_data struct slightly so that the fields are better aligned. As a result on targets that support them, load-pair instructions accessing poly_scaled and invln2_scaled are now 16-byte aligned. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `44fa9c1080`)	2025-02-28 14:42:29 +00:00
Wilco Dijkstra	5fe151d86a	AArch64: Remove zva_128 from memset Remove ZVA 128 support from memset - the new memset no longer guarantees count >= 256, which can result in underflow and a crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA size of 128 and its memcpy implementation was removed in commit `e162ab2bf1`, remove this special case too. [1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com> (cherry picked from commit `a08d9a52f9`)	2025-02-28 14:42:29 +00:00
Wilco Dijkstra	95aa21432c	AArch64: Optimize memset Improve small memsets by avoiding branches and use overlapping stores. Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes other than 64 and 128. Performance of random memset benchmark improves by 24% on Neoverse N1. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `cec3aef324`)	2025-02-28 14:42:27 +00:00
Wilco Dijkstra	9ca74b8ad1	AArch64: Improve generic strlen Improve performance by handling another 16 bytes before entering the loop. Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final size computation to avoid increasing latency. On Neoverse V1 performance of the random strlen benchmark improves by 4.6%. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `3dc426b642`)	2025-02-28 14:41:56 +00:00
Siddhesh Poyarekar	f984e2d7e8	assert: Add test for CVE-2025-0395 Use the __progname symbol to override the program name to induce the failure that CVE-2025-0395 describes. This is related to BZ #32582 Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `cdb9ba8419`)	2025-02-13 13:02:23 -05:00
H.J. Lu	650a0aaaff	stdlib: Test using setenv with updated environ [BZ #32588 ] Add a test for setenv with updated environ. Verify that BZ #32588 is fixed. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit `8ab34497de`)	2025-01-25 10:30:58 +08:00
Siddhesh Poyarekar	c32fd59314	Fix underallocation of abort_msg_s struct (CVE-2025-0395) Include the space needed to store the length of the message itself, in addition to the message string. This resolves BZ #32582. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `68ee0f704c`)	2025-01-22 17:10:37 +01:00
Florian Weimer	549e7f7c5a	elf: Support recursive use of dynamic TLS in interposed malloc It turns out that quite a few applications use bundled mallocs that have been built to use global-dynamic TLS (instead of the recommended initial-exec TLS). The previous workaround from commit `afe42e935b` ("elf: Avoid some free (NULL) calls in _dl_update_slotinfo") does not fix all encountered cases unfortunatelly. This change avoids the TLS generation update for recursive use of TLS from a malloc that was called during a TLS update. This is possible because an interposed malloc has a fixed module ID and TLS slot. (It cannot be unloaded.) If an initially-loaded module ID is encountered in __tls_get_addr and the dynamic linker is already in the middle of a TLS update, use the outdated DTV, thus avoiding another call into malloc. It's still necessary to update the DTV to the most recent generation, to get out of the slow path, which is why the check for recursion is needed. The bookkeeping is done using a global counter instead of per-thread flag because TLS access in the dynamic linker is tricky. All this will go away once the dynamic linker stops using malloc for TLS, likely as part of a change that pre-allocates all TLS during pthread_create/dlopen. Fixes commit `d2123d6827` ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit `018f0fc3b8`)	2025-01-10 16:49:38 -08:00
Florian Weimer	48642ef1a5	elf: Avoid some free (NULL) calls in _dl_update_slotinfo This has been confirmed to work around some interposed mallocs. Here is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper in lttng-tools: New TLS usage in libgcc_s.so.1, compatibility impact <https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/> Reportedly, this patch also papers over a similar issue when tcmalloc 2.9.1 is not compiled with -ftls-model=initial-exec. Of course the goal really should be to compile mallocs with the initial-exec TLS model, but this commit appears to be a useful interim workaround. Fixes commit `d2123d6827` ("elf: Fix slow tls access after dlopen [BZ #19924]"). Reviewed-by: Carlos O'Donell <carlos@redhat.com> (cherry picked from commit `afe42e935b`)	2025-01-10 16:48:26 -08:00
Noah Goldstein	12fec8aae5	x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212 ] The loop should be aligned to 32-bytes so that it can ideally run out the DSB. This is particularly important on Skylake-Server where deficiencies in it's DSB implementation make it prone to not being able to run loops out of the DSB. For example running strcmp-evex on 200Mb string: 32-byte aligned loop: - 43,399,578,766 idq.dsb_uops not 32-byte aligned loop: - 6,060,139,704 idq.dsb_uops This results in a 25% performance degradation for the non-aligned version. The fix is to just ensure the code layout is such that the loop is aligned. (Which was previously the case but was accidentally dropped in `84e7c46df`). NB: The fix was actually 64-byte alignment. This is because 64-byte alignment generally produces more stable performance than 32-byte aligned code (cache line crosses can affect perf), so if we are going past 16-byte alignmnent, might as well go to 64. 64-byte alignment also matches most other functions we over-align, so it creates a common point of optimization. Times are reported as ratio of Time_With_Patch / Time_Without_Patch. Lower is better. The values being reported is the geometric mean of the ratio across all tests in bench-strcmp and bench-strncmp. Note this patch is only attempting to improve the Skylake-Server strcmp for long strings. The rest of the numbers are only to test for regressions. Tigerlake Results Strings <= 512: strcmp : 1.026 strncmp: 0.949 Tigerlake Results Strings > 512: strcmp : 0.994 strncmp: 0.998 Skylake-Server Results Strings <= 512: strcmp : 0.945 strncmp: 0.943 Skylake-Server Results Strings > 512: strcmp : 0.778 strncmp: 1.000 The 2.6% regression on TGL-strcmp is due to slowdowns caused by changes in alignment of code handling small sizes (most on the page-cross logic). These should be safe to ignore because 1) We previously only 16-byte aligned the function so this behavior is not new and was essentially up to chance before this patch and 2) this type of alignment related regression on small sizes really only comes up in tight micro-benchmark loops and is unlikely to have any affect on realworld performance. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `483443d321`)	2025-01-09 17:23:28 -08:00
Noah Goldstein	04b8d48432	x86: Improve large memset perf with non-temporal stores [RHEL-29312] Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit `5bf0ab8057`)	2025-01-09 17:23:28 -08:00
Gabi Falk	dc1762113d	x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4) This code expects the WCSCAT preprocessor macro to be predefined in case the evex implementation of the function should be defined with a name different from __wcsncat_evex. However, when glibc is built for x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the variable to WCSNCAT, as it is actually a better naming choice for the variable in this case. Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: `64b8b6516b` ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit `dd5f891c1a`)	2025-01-09 17:23:28 -08:00
H.J. Lu	0d14bf0754	sysdeps/x86/Makefile: Split and sort tests Put each test on a separate line and sort tests. (cherry picked from commit `7e03e0de7e`)	2025-01-09 17:23:28 -08:00
Noah Goldstein	5a64f93365	x86: Only align destination to 1x VEC_SIZE in memset 4x loop Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on performance other than potentially resulting in an additional iteration of the loop. 1x maintains aligned stores (the only reason to align in this case) and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit `9469261cf1`)	2025-01-09 17:23:28 -08:00
Szabolcs Nagy	7772f9358c	elf: Fix slow tls access after dlopen [BZ #19924 ] In short: __tls_get_addr checks the global generation counter and if the current dtv is older then _dl_update_slotinfo updates dtv up to the generation of the accessed module. So if the global generation is newer than generation of the module then __tls_get_addr keeps hitting the slow dtv update path. The dtv update path includes a number of checks to see if any update is needed and this already causes measurable tls access slow down after dlopen. It may be possible to detect up-to-date dtv faster. But if there are many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at least walking the slotinfo list. This patch tries to update the dtv to the global generation instead, so after a dlopen the tls access slow path is only hit once. The modules with larger generation than the accessed one were not necessarily synchronized before, so additional synchronization is needed. This patch uses acquire/release synchronization when accessing the generation counter. Note: in the x86_64 version of dl-tls.c the generation is only loaded once, since relaxed mo is not faster than acquire mo load. I have not benchmarked this. Tested by Adhemerval Zanella on aarch64, powerpc, sparc, x86 who reported that it fixes the performance issue of bug 19924. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit `d2123d6827`)	2025-01-09 07:31:25 -08:00

1 2 3 4 5 ...

40399 Commits All Branches Search

40399 Commits

All Branches