Commit Graph

332 Commits

Author SHA1 Message Date
Rafael Aquini 36fba2318e ELF: fix kernel.randomize_va_space double read
JIRA: https://issues.redhat.com/browse/RHEL-60757
CVE: CVE-2024-46826

This patch is a backport of the following upstream commit:
commit 2a97388a807b6ab5538aa8f8537b2463c6988bd2
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date:   Fri Jun 21 21:54:50 2024 +0300

    ELF: fix kernel.randomize_va_space double read

    ELF loader uses "randomize_va_space" twice. It is sysctl and can change
    at any moment, so 2 loads could see 2 different values in theory with
    unpredictable consequences.

    Issue exactly one load for consistent value across one exec.

    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Link: https://lore.kernel.org/r/3329905c-7eb8-400a-8f0a-d87cff979b5b@p183
    Signed-off-by: Kees Cook <kees@kernel.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-10-03 09:00:25 -04:00
Rafael Aquini 410830503d mm: always expand the stack with the mmap write lock held
JIRA: https://issues.redhat.com/browse/RHEL-27742
Conflicts:
  * arch/parisc/mm/fault.c: hunks dropped as there were merge conflicts not
       worth of fixing for this unsupported hardware arch;
  * drivers/iommu/amd/iommu_v2.c: hunk dropped given out-of-order backport
       of upstream commit 5a0b11a180a9 ("iommu/amd: Remove iommu_v2 module")
  * mm/memory.c: differences on the 2nd hunk due to upstream conflict with
       commit ca5e863233e8 ("mm/gup: remove vmas parameter from
       get_user_pages_remote()") that ended up solved by merge commit
       9471f1f2f502 ("Merge branch 'expand-stack'").

This patch is a backport of the following upstream commit:
commit 8d7071af890768438c14db6172cc8f9f4d04e184
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sat Jun 24 13:45:51 2023 -0700

    mm: always expand the stack with the mmap write lock held

    This finishes the job of always holding the mmap write lock when
    extending the user stack vma, and removes the 'write_locked' argument
    from the vm helper functions again.

    For some cases, we just avoid expanding the stack at all: drivers and
    page pinning really shouldn't be extending any stacks.  Let's see if any
    strange users really wanted that.

    It's worth noting that architectures that weren't converted to the new
    lock_mm_and_find_vma() helper function are left using the legacy
    "expand_stack()" function, but it has been changed to drop the mmap_lock
    and take it for writing while expanding the vma.  This makes it fairly
    straightforward to convert the remaining architectures.

    As a result of dropping and re-taking the lock, the calling conventions
    for this function have also changed, since the old vma may no longer be
    valid.  So it will now return the new vma if successful, and NULL - and
    the lock dropped - if the area could not be extended.

    Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
    Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
    Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:19 -04:00
Rafael Aquini 0ce393dc54 mm: make find_extend_vma() fail if write lock not held
JIRA: https://issues.redhat.com/browse/RHEL-27742

This patch is a backport of the following upstream commit:
commit f440fa1ac955e2898893f9301568435eb5cdfc4b
Author: Liam R. Howlett <Liam.Howlett@oracle.com>
Date:   Fri Jun 16 15:58:54 2023 -0700

    mm: make find_extend_vma() fail if write lock not held

    Make calls to extend_vma() and find_extend_vma() fail if the write lock
    is required.

    To avoid making this a flag-day event, this still allows the old
    read-locking case for the trivial situations, and passes in a flag to
    say "is it write-locked".  That way write-lockers can say "yes, I'm
    being careful", and legacy users will continue to work in all the common
    cases until they have been fully converted to the new world order.

    Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
    Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Rafael Aquini <raquini@redhat.com>
2024-09-05 20:37:17 -04:00
Baoquan He 805ddef1b4 coredump, vmcore: Set p_align to 4 for PT_NOTE
JIRA: https://issues.redhat.com/browse/RHEL-32199

Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Conflict: There's fuzz in fs/binfmt_elf.c which need be handled
          manually.

commit 60592fb6b67c653beaa2e7acad9a9d7aa0b71dff
Author: Fangrui Song <maskray@google.com>
Date:   Fri May 12 02:25:28 2023 +0000

    coredump, vmcore: Set p_align to 4 for PT_NOTE

    Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program
    header as an array of 4-byte entries or 8-byte entries. Currently, there
    are workarounds[1] in place for Linux to treat p_align==0 as 4. However,
    it would be more appropriate to set the correct alignment so that tools
    do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as
    well.

    [1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6
    [2]: https://reviews.llvm.org/D150022

    Signed-off-by: Fangrui Song <maskray@google.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com

Signed-off-by: Baoquan He <bhe@redhat.com>
2024-05-15 10:32:32 +08:00
Prarit Bhargava f681a58e1d elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}
JIRA: https://issues.redhat.com/browse/RHEL-25415

commit 19e183b54528f11fafeca60fc6d0821e29ff281e
Author: Catalin Marinas <catalin.marinas@arm.com>
Date:   Thu Dec 22 18:12:50 2022 +0000

    elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}

    A subsequent fix for arm64 will use this parameter to parse the vma
    information from the snapshot created by dump_vma_snapshot() rather than
    traversing the vma list without the mmap_lock.

    Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file")
    Cc: <stable@vger.kernel.org> # 5.18.x
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Reported-by: Seth Jenkins <sethjenkins@google.com>
    Suggested-by: Seth Jenkins <sethjenkins@google.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com
    Signed-off-by: Will Deacon <will@kernel.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:42:53 -04:00
Prarit Bhargava 64c1109ba6 uninline elf_core_copy_task_fpregs() (and lose pt_regs argument)
JIRA: https://issues.redhat.com/browse/RHEL-25415

Conflicts: Minor drift issues.

commit bdbadfcc37c5c8f9f2a401a18eae71b0c28799ee
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Sat Sep 3 20:23:56 2022 -0400

    [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument)

    Don't bother with pointless macros - we are not sharing it with aout coredumps
    anymore.  Just convert the underlying functions to the same arguments (nobody
    uses regs, actually) and call them elf_core_copy_task_fpregs().  And unexport
    the entire bunch, while we are at it.

    [added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
    warnings about the lack of externs getting added to huge piles for those
    files.  Pointless, but...]

    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2024-03-20 09:42:49 -04:00
Jaroslav Kysela c2ecc3ca46 ELF: fix all "Elf" typos
Bugzilla: https://bugzilla.redhat.com/2179848

commit 70e79866ab36feaaed8ef26dacfbcbac6a0631c9
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date: Tue Feb 28 15:14:17 2023 +0300

    ELF: fix all "Elf" typos

    ELF is acronym and therefore should be spelled in all caps.

    I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like
    being written in the first person.

    Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Jaroslav Kysela <jkysela@redhat.com>
2023-06-21 16:22:19 +02:00
Jan Stancek 046843b8c8 Merge: Update KUNIT for el9.3
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/2163

```
This rebase brings in new kunit tests that we are already running for
ARK. Enable these here to increase our downstream testing efforts.

Specifically disable the CLK test. We have this enabled in ARK and have
been seeing unstable results since it has been added upstream, as well
as issues downstream here.

```
Reviewers Note:

My previous kunit rebases have avoided bringing in as many as the kunit tools patches as they are unused in our environment; however, due to the increase in changes that touch both kunit core and kunit tool files, these backports become more tedious. Because we dont utilize the kunit tools folder there is no harm in pulling in these patches.

Tested: https://beaker.engineering.redhat.com/jobs/7604155

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168378

Signed-off-by: Nico Pache <npache@redhat.com>

Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Mika Penttilä <mpenttil@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>

Signed-off-by: Jan Stancek <jstancek@redhat.com>
2023-04-29 10:45:38 +02:00
Nico Pache de55268d86 binfmt_elf: Introduce KUnit test
commit 9e1a3ce0a952450a1163cc93ab1df6d4fa8c8155
Author: Kees Cook <keescook@chromium.org>
Date:   Wed Feb 23 19:32:10 2022 -0800

    binfmt_elf: Introduce KUnit test

    Adds simple KUnit test for some binfmt_elf internals: specifically a
    regression test for the problem fixed by commit 8904d9cd90ee ("ELF:
    fix overflow in total mapping size calculation").

    $ ./tools/testing/kunit/kunit.py run --arch x86_64 \
        --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf'
    ...
    [19:41:08] ================== binfmt_elf (1 subtest) ==================
    [19:41:08] [PASSED] total_mapping_size_test
    [19:41:08] =================== [PASSED] binfmt_elf ====================
    [19:41:08] ============== compat_binfmt_elf (1 subtest) ===============
    [19:41:08] [PASSED] total_mapping_size_test
    [19:41:08] ================ [PASSED] compat_binfmt_elf ================
    [19:41:08] ============================================================
    [19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0

    Cc: Eric Biederman <ebiederm@xmission.com>
    Cc: David Gow <davidgow@google.com>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de>
    Cc: kunit-dev@googlegroups.com
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Kees Cook <keescook@chromium.org>
    ---
    v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org
    v2:
     - improve commit log
     - fix comment URL (Daniel)
     - drop redundant KUnit Kconfig help info (Daniel)
     - note in Kconfig help that COMPAT builds add a compat test (David)

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168378
Signed-off-by: Nico Pache <npache@redhat.com>
2023-04-17 11:47:33 -06:00
Ricardo Robaina 379206e55a coredump: Use the vma snapshot in fill_files_note
Bugzilla: https://bugzilla.redhat.com/2169741
CVE: CVE-2023-1249

commit 390031c942116d4733310f0684beb8db19885fe6
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Mar 8 13:04:19 2022 -0600

    coredump: Use the vma snapshot in fill_files_note

    Matthew Wilcox reported that there is a missing mmap_lock in
    file_files_note that could possibly lead to a user after free.

    Solve this by using the existing vma snapshot for consistency
    and to avoid the need to take the mmap_lock anywhere in the
    coredump code except for dump_vma_snapshot.

    Update the dump_vma_snapshot to capture vm_pgoff and vm_file
    that are neeeded by fill_files_note.

    Add free_vma_snapshot to free the captured values of vm_file.

    Reported-by: Matthew Wilcox <willy@infradead.org>
    Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org
    Cc: stable@vger.kernel.org
    Fixes: a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
    Fixes: 2aa362c49c ("coredump: extend core dump note section to contain file names of mapped files")
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
2023-03-29 21:09:19 -03:00
Chris von Recklinghausen 7c80076e63 coredump: Snapshot the vmas in do_coredump
Bugzilla: https://bugzilla.redhat.com/2160210

commit 95c5436a4883841588dae86fb0b9325f47ba5ad3
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Tue Mar 8 12:55:29 2022 -0600

    coredump: Snapshot the vmas in do_coredump

    Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the
    individual coredump routines into do_coredump itself.  This makes
    the code less error prone and easier to maintain.

    Make the vma snapshot available to the coredump routines
    in struct coredump_params.  This makes it easier to
    change and update what is captures in the vma snapshot
    and will be needed for fixing fill_file_notes.

    Reviewed-by: Jann Horn <jannh@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-03-24 11:18:33 -04:00
Mark Salter a3fc407147 binfmt_elf: Don't write past end of notes for regset gap
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122232

commit dd664099002db909912a23215f8775c97f7f4f10
Author: Rick Edgecombe <rick.p.edgecombe@intel.com>
Date: Thu, 17 Mar 2022 12:20:13 -0700

    In fill_thread_core_info() the ptrace accessible registers are collected
    to be written out as notes in a core file. The note array is allocated
    from a size calculated by iterating the user regset view, and counting the
    regsets that have a non-zero core_note_type. However, this only allows for
    there to be non-zero core_note_type at the end of the regset view. If
    there are any gaps in the middle, fill_thread_core_info() will overflow the
    note allocation, as it iterates over the size of the view and the
    allocation would be smaller than that.

    There doesn't appear to be any arch that has gaps such that they exceed
    the notes allocation, but the code is brittle and tries to support
    something it doesn't. It could be fixed by increasing the allocation size,
    but instead just have the note collecting code utilize the array better.
    This way the allocation can stay smaller.

    Even in the case of no arch's that have gaps in their regset views, this
    introduces a change in the resulting indicies of t->notes. It does not
    introduce any changes to the core file itself, because any blank notes are
    skipped in write_note_info().

    In case, the allocation logic between fill_note_info() and
    fill_thread_core_info() ever diverges from the usage logic, warn and skip
    writing any notes that would overflow the array.

    This fix is derrived from an earlier one[0] by Yu-cheng Yu.

    [0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/

    Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
    Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
    Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com

Signed-off-by: Mark Salter <msalter@redhat.com>
2023-01-28 11:34:25 -05:00
Mark Salter df653a61ef coredump/elf: Pass coredump_params into fill_note_info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122232

commit 9ec7d3230717b4fe9b6c7afeb4811909c23fa1d7
Author: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Mon, 31 Jan 2022 12:17:38 -0600

    Instead of individually passing cprm->siginfo and cprm->regs
    into fill_note_info pass all of struct coredump_params.

    This is preparation to allow fill_files_note to use the existing
    vma snapshot.

    Reviewed-by: Jann Horn <jannh@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Mark Salter <msalter@redhat.com>
2023-01-28 11:34:25 -05:00
Chris von Recklinghausen 2ef8f32db2 coredump: Limit coredumps to a single thread group
Bugzilla: https://bugzilla.redhat.com/2120352

commit 0258b5fd7c7124b87e185a1a9322d2c66b1876b7
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Wed Sep 22 11:24:02 2021 -0500

    coredump: Limit coredumps to a single thread group

    Today when a signal is delivered with a handler of SIG_DFL whose
    default behavior is to generate a core dump not only that process but
    every process that shares the mm is killed.

    In the case of vfork this looks like a real world problem.  Consider
    the following well defined sequence.

            if (vfork() == 0) {
                    execve(...);
                    _exit(EXIT_FAILURE);
            }

    If a signal that generates a core dump is received after vfork but
    before the execve changes the mm the process that called vfork will
    also be killed (as the mm is shared).

    Similarly if the execve fails after the point of no return the kernel
    delivers SIGSEGV which will kill both the exec'ing process and because
    the mm is shared the process that called vfork as well.

    As far as I can tell this behavior is a violation of people's
    reasonable expectations, POSIX, and is unnecessarily fragile when the
    system is low on memory.

    Solve this by making a userspace visible change to only kill a single
    process/thread group.  This is possible because Jann Horn recently
    modified[1] the coredump code so that the mm can safely be modified
    while the coredump is happening.  With LinuxThreads long gone I don't
    expect anyone to have a notice this behavior change in practice.

    To accomplish this move the core_state pointer from mm_struct to
    signal_struct, which allows different thread groups to coredump
    simultatenously.

    In zap_threads remove the work to kill anything except for the current
    thread group.

    v2: Remove core_state from the VM_BUG_ON_MM print to fix
        compile failure when CONFIG_DEBUG_VM is enabled.
        Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>

    [1] a07279c9a8 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
    Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
    History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
    Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.au
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:25 -04:00
Rafael Aquini 2a286bf7a6 binfmt: remove in-tree usage of MAP_DENYWRITE
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 4589ff7ca81516381393649dec8dd4948884b2b2
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Apr 23 09:42:41 2021 +0200

    binfmt: remove in-tree usage of MAP_DENYWRITE

    At exec time when we mmap the new executable via MAP_DENYWRITE we have it
    opened via do_open_execat() and already deny_write_access()'ed the file
    successfully. Once exec completes, we allow_write_acces(); however,
    we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
    also deny_write_access() as long as mm->exe_file remains set. We'll
    effectively deny write access to our executable via mm->exe_file
    until mm->exe_file is changed -- when the process is removed, on new
    exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

    Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
    mm->exe_file.

    In case of an elf interpreter, we'll now only deny write access to the file
    during exec. This is somewhat okay, because the interpreter behaves
    (and sometime is) a shared library; all shared libraries, especially the
    ones loaded directly in user space like via dlopen() won't ever be mapped
    via MAP_DENYWRITE, because we ignore that from user space completely;
    these shared libraries can always be modified while mapped and executed.
    Let's only special-case the main executable, denying write access while
    being executed by a process. This can be considered a minor user space
    visible change.

    While this is a cleanup, it also fixes part of a problem reported with
    VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
    this patch and will be removed next:
      "Overlayfs did not honor positive i_writecount on realfile for
       VM_DENYWRITE mappings." [1]

    [1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

    Reported-by: Chengguang Xu <cgxu519@mykernel.net>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:31 -05:00
Rafael Aquini f1f64a1731 binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 42be8b42535183f84df99acbaf799e38724348f3
Author: David Hildenbrand <david@redhat.com>
Date:   Thu Apr 22 12:53:00 2021 +0200

    binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()

    uselib() is the legacy systemcall for loading shared libraries.
    Nowadays, applications use dlopen() to load shared libraries, completely
    implemented in user space via mmap().

    For example, glibc uses MAP_COPY to mmap shared libraries. While this
    maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
    MAP_DENYWRITE specification from user space in mmap.

    With this change, all remaining in-tree users of MAP_DENYWRITE use it
    to map an executable. We will be able to open shared libraries loaded
    via uselib() writable, just as we already can via dlopen() from user
    space.

    This is one step into the direction of removing MAP_DENYWRITE from the
    kernel. This can be considered a minor user space visible change.

    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:29 -05:00
Linus Torvalds 65090f30ab Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "191 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
  slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
  mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
  pagealloc, and memory-failure)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
  mm,hwpoison: make get_hwpoison_page() call get_any_page()
  mm,hwpoison: send SIGBUS with error virutal address
  mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
  mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
  mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
  mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
  docs: remove description of DISCONTIGMEM
  arch, mm: remove stale mentions of DISCONIGMEM
  mm: remove CONFIG_DISCONTIGMEM
  m68k: remove support for DISCONTIGMEM
  arc: remove support for DISCONTIGMEM
  arc: update comment about HIGHMEM implementation
  alpha: remove DISCONTIGMEM and NUMA
  mm/page_alloc: move free_the_page
  mm/page_alloc: fix counting of managed_pages
  mm/page_alloc: improve memmap_pages dbg msg
  mm: drop SECTION_SHIFT in code comments
  mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
  mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
  mm/page_alloc: scale the number of pages that are batch freed
  ...
2021-06-29 17:29:11 -07:00
David Hildenbrand a4eec6a3df binfmt: remove in-tree usage of MAP_EXECUTABLE
Ever since commit e9714acf8c ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
essentially completely ignored.  Let's remove all usage of MAP_EXECUTABLE.

[akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]

Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:50 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Al Viro d0f1088b31 coredump: don't bother with do_truncate()
have dump_skip() just remember how much needs to be skipped,
leave actual seeks/writing zeroes to the next dump_emit()
or the end of coredump output, whichever comes first.
And instead of playing with do_truncate() in the end, just
write one NUL at the end of the last gap (if any).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-08 10:21:11 -05:00
Linus Torvalds 08179b47e1 Merge branch 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - Optimize parisc page table locks by using the existing
   page_table_lock

 - Export argv0-preserve flag in binfmt_misc for usage in qemu-user

 - Fix interrupt table (IVT) checksum so firmware will call crash
   handler (HPMC)

 - Increase IRQ stack to 64kb on 64-bit kernel

 - Switch to common devmem_is_allowed() implementation

 - Minor fix to get_whan()

* 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  binfmt_misc: pass binfmt_misc flags to the interpreter
  parisc: Optimize per-pagetable spinlocks
  parisc: Replace test_ti_thread_flag() with test_tsk_thread_flag()
  parisc: Bump 64-bit IRQ stack size to 64 KB
  parisc: Fix IVT checksum calculation wrt HPMC
  parisc: Use the generic devmem_is_allowed()
  parisc: Drop out of get_whan() if task is running again
2021-02-21 13:20:41 -08:00
Laurent Vivier 2347961b11 binfmt_misc: pass binfmt_misc flags to the interpreter
It can be useful to the interpreter to know which flags are in use.

For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.

This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.

Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: YunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460
Signed-off-by: Helge Deller <deller@gmx.de>
2021-02-15 18:28:30 +01:00
Al Viro f2485a2dc9 elf_prstatus: collect the common part (everything before pr_reg) into a struct
Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
the beginning of compat_elf_prstatus, take these fields into a separate
structure (compat_elf_prstatus_common), so that it could be reused.  Due to
the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
need the same shape change done to native struct elf_prstatus, gathering the
fields prior to pr_reg into a new structure (struct elf_prstatus_common).

Fortunately, offset of pr_reg is always a multiple of 16 with no padding
right before it, so it's possible to turn all the stuff prior to it into
a single member without disturbing the layout.

[build fix from Geert Uytterhoeven folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-06 08:38:29 -05:00
Al Viro 8a00dd0012 binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID
On 64bit architectures that support 32bit processes there are
two possible layouts for NT_PRSTATUS note in ELF coredumps.
For one thing, several fields are 64bit for native processes
and 32bit for compat ones (pr_sigpend, etc.).  For another,
the register dump is obviously different - the size and number
of registers are not going to be the same for 32bit and 64bit
variants of processor.

Usually that's handled by having two structures - elf_prstatus
for native layout and compat_elf_prstatus for 32bit one.
32bit processes are handled by fs/compat_binfmt_elf.c, which
defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
Then it includes fs/binfmt_elf.c, which makes all references to
struct elf_prstatus to be textually replaced with struct
compat_elf_prstatus.  Ugly and somewhat brittle, but it works.

However, amd64 is worse - there are _three_ possible layouts.
One for native 64bit processes, another for i386 (32bit) processes
and yet another for x32 (32bit address space with full 64bit
registers).

Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
with usual compat_binfmt_elf.c trickery.  However, the layouts
for i386 and x32 are not identical - they have the common beginning,
but the register dump part (pr_reg) is bigger on x32.  Worse, pr_reg
is not the last field - it's followed by int pr_fpvalid, so that
field ends up at different offsets for i386 and x32 layouts.

Fortunately, there's not much code that cares about any of that -
it's all encapsulated in fill_thread_core_info().  Since x32
variant is bigger, we define compat_elf_prstatus to match that
layout.  That way i386 processes have enough space to fit
their layout into.

Moreover, since these layouts are identical prior to pr_reg,
we don't need to distinguish x32 and i386 cases when we are
setting the fields prior to pr_reg.

Filling pr_reg itself is done by calling ->get() method of
appropriate regset, and that method knows what layout (and size)
to use.

We do need to distinguish x32 and i386 cases only for two
things: setting ->pr_fpvalid (offset differs for x32 and
i386) and choosing the right size for our note.

The way it's done is Not Nice, for the lack of more accurate
printable description.  There are two macros (PRSTATUS_SIZE and
SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
and (S)->pr_fpvalid = 1.  On x86 asm/compat.h provides its own
variants.

Unfortunately, quite a few things go wrong there:
	* PRSTATUS_SIZE doesn't use the normal test for process
being an x32 one; it compares the size reported by regset with
the size of pr_reg.
	* it hardcodes the sizes of x32 and i386 variants (296 and 144
resp.), so if some change in includes leads to asm/compat.h pulled
in by fs/binfmt_elf.c we are in trouble - it will end up using
the size of x32 variant for 64bit processes.
	* it's in the wrong place; asm/compat.h couldn't define
the structure for i386 layout, since it lacks quite a few types
needed for it.  Hardcoded sizes are largely due to that.

The proper fix would be to have an explicitly defined i386 variant
of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
TIF_X32 to choose the variant that should be used.  Unfortunately,
that requires some manipulations of headers; we'll do that later
in the series, but for now let's go with the minimal variant -
rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
and use the normal TIF_X32 check in that macro.  The size of i386 variant
is kept hardcoded for now.  Similar story for SET_PR_FPVALID.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 18:27:37 -05:00
Linus Torvalds faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Linus Torvalds 405f868f13 - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end
(Gabriel Krisman Bertazi)
 
 - All kinds of minor cleanups all over the tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe
 VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS
 0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi
 r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz
 aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua
 Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf
 PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i
 CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b
 G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM
 eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ
 mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq
 0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao=
 =lLiH
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:
 "Another branch with a nicely negative diffstat, just the way I
  like 'em:

   - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in
     the end (Gabriel Krisman Bertazi)

   - All kinds of minor cleanups all over the tree"

* tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/ia32_signal: Propagate __user annotation properly
  x86/alternative: Update text_poke_bp() kernel-doc comment
  x86/PCI: Make a kernel-doc comment a normal one
  x86/asm: Drop unused RDPID macro
  x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg
  x86/head64: Remove duplicate include
  x86/mm: Declare 'start' variable where it is used
  x86/head/64: Remove unused GET_CR2_INTO() macro
  x86/boot: Remove unused finalize_identity_maps()
  x86/uaccess: Document copy_from_user_nmi()
  x86/dumpstack: Make show_trace_log_lvl() static
  x86/mtrr: Fix a kernel-doc markup
  x86/setup: Remove unused MCA variables
  x86, libnvdimm/test: Remove COPY_MC_TEST
  x86: Reclaim TIF_IA32 and TIF_X32
  x86/mm: Convert mmu context ia32_compat into a proper flags field
  x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages()
  elf: Expose ELF header on arch_setup_additional_pages()
  x86/elf: Use e_machine to select start_thread for x32
  elf: Expose ELF header in compat_start_thread()
  ...
2020-12-14 13:45:26 -08:00
Eric W. Biederman c39ab6de22 coredump: Document coredump code exclusively used by cell spufs
Oleg Nesterov recently asked[1] why is there an unshare_files in
do_coredump.  After digging through all of the callers of lookup_fd it
turns out that it is
arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
that needs the unshare_files in do_coredump.

Looking at the history[2] this code was also the only piece of coredump code
that required the unshare_files when the unshare_files was added.

Looking at that code it turns out that cell is also the only
architecture that implements elf_coredump_extra_notes_size and
elf_coredump_extra_notes_write.

I looked at the gdb repo[3] support for cell has been removed[4] in binutils
2.34.  Geoff Levand reports he is still getting questions on how to
run modern kernels on the PS3, from people using 3rd party firmware so
this code is not dead.  According to Wikipedia the last PS3 shipped in
Japan sometime in 2017.  So it will probably be a little while before
everyone's hardware dies.

Add some comments briefly documenting the coredump code that exists
only to support cell spufs to make it easier to understand the
coredump code.  Eventually the hardware will be dead, or their won't
be userspace tools, or the coredump code will be refactored and it
will be too difficult to update a dead architecture and these comments
make it easy to tell where to pull to remove cell spufs support.

[1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
[2] 179e037fc1 ("do_coredump(): make sure that descriptor table isn't shared")
[3] git://sourceware.org/git/binutils-gdb.git
[4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:57:08 -06:00
Gustavo A. R. Silva 5e01fdff04 fs: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29 17:22:59 -05:00
Gabriel Krisman Bertazi 9a29a67190 elf: Expose ELF header on arch_setup_additional_pages()
Like it is done for SET_PERSONALITY with ARM, which requires the ELF
header to select correct personality parameters, x86 requires the
headers when selecting which VDSO to load, instead of relying on the
going-away TIF_IA32/X32 flags.

Add an indirection macro to arch_setup_additional_pages(), that x86 can
reimplement to receive the extra parameter just for ELF files.  This
requires no changes to other architectures, who can continue to use the
original arch_setup_additional_pages for ELF and non-ELF binaries.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-8-krisman@collabora.com
2020-10-26 13:46:47 +01:00
Gabriel Krisman Bertazi bc3d7bf61a elf: Expose ELF header in compat_start_thread()
Like it is done for SET_PERSONALITY with x86, which requires the ELF header
to select correct personality parameters, x86 requires the headers on
compat_start_thread() to choose starting CS for ELF32 binaries, instead of
relying on the going-away TIF_IA32/X32 flags.

Add an indirection macro to ELF invocations of START_THREAD, that x86 can
reimplement to receive the extra parameter just for ELF files.  This
requires no changes to other architectures who don't need the header
information, they can continue to use the original start_thread for ELF and
non-ELF binaries, and it prevents affecting non-ELF code paths for x86.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-6-krisman@collabora.com
2020-10-26 13:46:46 +01:00
Jann Horn b2767d97f5 binfmt_elf: take the mmap lock around find_extend_vma()
create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it.  (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)

While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.

Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.

(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Jann Horn a07279c9a8 binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.

An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g.  because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.

Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.

A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.

There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy.  (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently.  Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn 429a22e776 coredump: rework elf/elf_fdpic vma_dump_size() into common helper
At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.

Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.

As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.

This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem.  (And arguably it looks nicer and makes
more sense in generic code.)

Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn afc63a97b7 coredump: refactor page range dumping into common helper
Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
pages into the coredump file.  Extract that logic into a common helper.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Chris Kennelly ce81bb256a fs/binfmt_elf: use PT_LOAD p_align values for suitable start address
Patch series "Selecting Load Addresses According to p_align", v3.

The current ELF loading mechancism provides page-aligned mappings.  This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

While specifying -z,max-page-size=0x200000 to the linker will generate
suitably aligned segments for huge pages on x86_64, the executable needs
to be loaded at a suitably aligned address as well.  This alignment
requires the binary's cooperation, as distinct segments need to be
appropriately paddded to be eligible for THP.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

This patch (of 2):

The current ELF loading mechancism provides page-aligned mappings.  This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.

[akpm@linux-foundation.org: fix max() warning]
[ckennelly@google.com: augment comment]
  Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com

Signed-off-by: Chris Kennelly <ckennelly@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Al Viro 7a896028ad kill elf_fpxregs_t
all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
been defined on any architecture since 2010

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:29:23 -04:00
Al Viro b4e9c9549f introduction of regset ->get() wrappers, switching ELF coredumps to those
Two new helpers: given a process and regset, dump into a buffer.
regset_get() takes a buffer and size, regset_get_alloc() takes size
and allocates a buffer.

Return value in both cases is the amount of data actually dumped in
case of success or -E...  on error.

In both cases the size is capped by regset->n * regset->size, so
->get() is called with offset 0 and size no more than what regset
expects.

binfmt_elf.c callers of ->get() are switched to using those; the other
caller (copy_regset_to_user()) will need some preparations to switch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:24:50 -04:00
Linus Torvalds 4382a79b27 Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc uaccess updates from Al Viro:
 "Assorted uaccess patches for this cycle - the stuff that didn't fit
  into thematic series"

* 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
  x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
  user_regset_copyout_zero(): use clear_user()
  TEST_ACCESS_OK _never_ had been checked anywhere
  x86: switch cp_stat64() to unsafe_put_user()
  binfmt_flat: don't use __put_user()
  binfmt_elf_fdpic: don't use __... uaccess primitives
  binfmt_elf: don't bother with __{put,copy_to}_user()
  pselect6() and friends: take handling the combined 6th/7th args into helper
2020-06-10 16:02:54 -07:00
Linus Torvalds 886d7de631 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
   __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

 - Various other little subsystems

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  lib/ubsan.c: fix gcc-10 warnings
  tools/testing/selftests/vm: remove duplicate headers
  selftests: vm: pkeys: fix multilib builds for x86
  selftests: vm: pkeys: use the correct page size on powerpc
  selftests/vm/pkeys: override access right definitions on powerpc
  selftests/vm/pkeys: test correct behaviour of pkey-0
  selftests/vm/pkeys: introduce a sub-page allocator
  selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
  selftests/vm/pkeys: associate key on a mapped page and detect write violation
  selftests/vm/pkeys: associate key on a mapped page and detect access violation
  selftests/vm/pkeys: improve checks to determine pkey support
  selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
  selftests/vm/pkeys: fix number of reserved powerpc pkeys
  selftests/vm/pkeys: introduce powerpc support
  selftests/vm/pkeys: introduce generic pkey abstractions
  selftests: vm: pkeys: use the correct huge page size
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
  selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
  selftests/vm/pkeys: fix pkey_disable_clear()
  selftests: vm: pkeys: add helpers for pkey bits
  ...
2020-06-04 19:18:29 -07:00
Anthony Iliopoulos 852991dd3a fs/binfmt_elf: remove redundant elf_map ifndef
The ifndef was added a long time ago to support archs that would define
their own mapping function.  The last user was the metag arch which was
removed from the tree, and as such there are no users left.  Let's kill
it.

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200402161543.4119-1-ailiop@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:25 -07:00
Linus Torvalds 15a2bc4dbb Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "Last cycle for the Nth time I ran into bugs and quality of
  implementation issues related to exec that could not be easily be
  fixed because of the way exec is implemented. So I have been digging
  into exec and cleanup up what I can.

  I don't think I have exec sorted out enough to fix the issues I
  started with but I have made some headway this cycle with 4 sets of
  changes.

   - promised cleanups after introducing exec_update_mutex

   - trivial cleanups for exec

   - control flow simplifications

   - remove the recomputation of bprm->cred

  The net result is code that is a bit easier to understand and work
  with and a decrease in the number of lines of code (if you don't count
  the added tests)"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
  exec: Compute file based creds only once
  exec: Add a per bprm->file version of per_clear
  binfmt_elf_fdpic: fix execfd build regression
  selftests/exec: Add binfmt_script regression test
  exec: Remove recursion from search_binary_handler
  exec: Generic execfd support
  exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
  exec: Move the call of prepare_binprm into search_binary_handler
  exec: Allow load_misc_binary to call prepare_binprm unconditionally
  exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
  exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
  exec: Teach prepare_exec_creds how exec treats uids & gids
  exec: Set the point of no return sooner
  exec: Move handling of the point of no return to the top level
  exec: Run sync_mm_rss before taking exec_update_mutex
  exec: Fix spelling of search_binary_handler in a comment
  exec: Move the comment from above de_thread to above unshare_sighand
  exec: Rename flush_old_exec begin_new_exec
  exec: Move most of setup_new_exec into flush_old_exec
  exec: In setup_new_exec cache current in the local variable me
  ...
2020-06-04 14:07:08 -07:00
Al Viro 646e84deb4 binfmt_elf: don't bother with __{put,copy_to}_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:56:47 -04:00
Linus Torvalds 8b39a57e96 Merge branch 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/coredump updates from Al Viro:
 "set_fs() removal in coredump-related area - mostly Christoph's
  stuff..."

* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
  binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
  binfmt_elf: remove the set_fs in fill_siginfo_note
  signal: refactor copy_siginfo_to_user32
  powerpc/spufs: simplify spufs core dumping
  powerpc/spufs: stop using access_ok
  powerpc/spufs: fix copy_to_user while atomic
2020-06-01 16:21:46 -07:00
Linus Torvalds 533b220f7b arm64 updates for 5.8
- Branch Target Identification (BTI)
 	* Support for ARMv8.5-BTI in both user- and kernel-space. This
 	  allows branch targets to limit the types of branch from which
 	  they can be called and additionally prevents branching to
 	  arbitrary code, although kernel support requires a very recent
 	  toolchain.
 
 	* Function annotation via SYM_FUNC_START() so that assembly
 	  functions are wrapped with the relevant "landing pad"
 	  instructions.
 
 	* BPF and vDSO updates to use the new instructions.
 
 	* Addition of a new HWCAP and exposure of BTI capability to
 	  userspace via ID register emulation, along with ELF loader
 	  support for the BTI feature in .note.gnu.property.
 
 	* Non-critical fixes to CFI unwind annotations in the sigreturn
 	  trampoline.
 
 - Shadow Call Stack (SCS)
 	* Support for Clang's Shadow Call Stack feature, which reserves
 	  platform register x18 to point at a separate stack for each
 	  task that holds only return addresses. This protects function
 	  return control flow from buffer overruns on the main stack.
 
 	* Save/restore of x18 across problematic boundaries (user-mode,
 	  hypervisor, EFI, suspend, etc).
 
 	* Core support for SCS, should other architectures want to use it
 	  too.
 
 	* SCS overflow checking on context-switch as part of the existing
 	  stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
 
 - CPU feature detection
 	* Removed numerous "SANITY CHECK" errors when running on a system
 	  with mismatched AArch32 support at EL1. This is primarily a
 	  concern for KVM, which disabled support for 32-bit guests on
 	  such a system.
 
 	* Addition of new ID registers and fields as the architecture has
 	  been extended.
 
 - Perf and PMU drivers
 	* Minor fixes and cleanups to system PMU drivers.
 
 - Hardware errata
 	* Unify KVM workarounds for VHE and nVHE configurations.
 
 	* Sort vendor errata entries in Kconfig.
 
 - Secure Monitor Call Calling Convention (SMCCC)
 	* Update to the latest specification from Arm (v1.2).
 
 	* Allow PSCI code to query the SMCCC version.
 
 - Software Delegated Exception Interface (SDEI)
 	* Unexport a bunch of unused symbols.
 
 	* Minor fixes to handling of firmware data.
 
 - Pointer authentication
 	* Add support for dumping the kernel PAC mask in vmcoreinfo so
 	  that the stack can be unwound by tools such as kdump.
 
 	* Simplification of key initialisation during CPU bringup.
 
 - BPF backend
 	* Improve immediate generation for logical and add/sub
 	  instructions.
 
 - vDSO
 	- Minor fixes to the linker flags for consistency with other
 	  architectures and support for LLVM's unwinder.
 
 	- Clean up logic to initialise and map the vDSO into userspace.
 
 - ACPI
 	- Work around for an ambiguity in the IORT specification relating
 	  to the "num_ids" field.
 
 	- Support _DMA method for all named components rather than only
 	  PCIe root complexes.
 
 	- Minor other IORT-related fixes.
 
 - Miscellaneous
 	* Initialise debug traps early for KGDB and fix KDB cacheflushing
 	  deadlock.
 
 	* Minor tweaks to early boot state (documentation update, set
 	  TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
 yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
 JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
 HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
 znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
 =w3qi
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.

  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support

  Branch Target Identification (BTI):

   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.

   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.

   - BPF and vDSO updates to use the new instructions.

   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.

   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.

  Shadow Call Stack (SCS):

   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.

   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).

   - Core support for SCS, should other architectures want to use it
     too.

   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

  CPU feature detection:

   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.

   - Addition of new ID registers and fields as the architecture has
     been extended.

  Perf and PMU drivers:

   - Minor fixes and cleanups to system PMU drivers.

  Hardware errata:

   - Unify KVM workarounds for VHE and nVHE configurations.

   - Sort vendor errata entries in Kconfig.

  Secure Monitor Call Calling Convention (SMCCC):

   - Update to the latest specification from Arm (v1.2).

   - Allow PSCI code to query the SMCCC version.

  Software Delegated Exception Interface (SDEI):

   - Unexport a bunch of unused symbols.

   - Minor fixes to handling of firmware data.

  Pointer authentication:

   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.

   - Simplification of key initialisation during CPU bringup.

  BPF backend:

   - Improve immediate generation for logical and add/sub instructions.

  vDSO:

   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.

   - Clean up logic to initialise and map the vDSO into userspace.

  ACPI:

   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.

   - Support _DMA method for all named components rather than only PCIe
     root complexes.

   - Minor other IORT-related fixes.

  Miscellaneous:

   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.

   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ...
2020-06-01 15:18:27 -07:00
Alexander Potapenko 1d605416fb fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
KMSAN reported uninitialized data being written to disk when dumping
core.  As a result, several kilobytes of kmalloc memory may be written
to the core file and then read by a non-privileged user.

Reported-by: sam <sunhaoyl@outlook.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com
Link: https://github.com/google/kmsan/issues/76
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-28 11:35:40 -07:00
Eric W. Biederman b8a61c9e7b exec: Generic execfd support
Most of the support for passing the file descriptor of an executable
to an interpreter already lives in the generic code and in binfmt_elf.
Rework the fields in binfmt_elf that deal with executable file
descriptor passing to make executable file descriptor passing a first
class concept.

Move the fd_install from binfmt_misc into begin_new_exec after the new
creds have been installed.  This means that accessing the file through
/proc/<pid>/fd/N is able to see the creds for the new executable
before allowing access to the new executables files.

Performing the install of the executables file descriptor after
the point of no return also means that nothing special needs to
be done on error.  The exiting of the process will close all
of it's open files.

Move the would_dump from binfmt_misc into begin_new_exec right
after would_dump is called on the bprm->file.  This makes it
obvious this case exists and that no nesting of bprm->file is
currently supported.

In binfmt_misc the movement of fd_install into generic code means
that it's special error exit path is no longer needed.

Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-21 10:16:57 -05:00
Eric W. Biederman 2388777a0a exec: Rename flush_old_exec begin_new_exec
There is and has been for a very long time been a lot more going on in
flush_old_exec than just flushing the old state.  After the movement
of code from setup_new_exec there is a whole lot more going on than
just flushing the old executables state.

Rename flush_old_exec to begin_new_exec to more accurately reflect
what this function does.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07 16:55:47 -05:00
Eric W. Biederman 96ecee29b0 exec: Merge install_exec_creds into setup_new_exec
The two functions are now always called one right after the
other so merge them together to make future maintenance easier.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07 16:55:47 -05:00
Christoph Hellwig d2530b436f binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
There is no logic in elf_core_dump itself or in the various arch helpers
called from it which use uaccess routines on kernel pointers except for
the file writes thate are nicely encapsulated by using __kernel_write in
dump_emit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-05 16:46:10 -04:00
Eric W. Biederman fa4751f454 binfmt_elf: remove the set_fs in fill_siginfo_note
The code in binfmt_elf.c is differnt from the rest of the code that
processes siginfo, as it sends siginfo from a kernel buffer to a file
rather than from kernel memory to userspace buffers.  To remove it's
use of set_fs the code needs some different siginfo helpers.

Add the helper copy_siginfo_to_external to copy from the kernel's
internal siginfo layout to a buffer in the siginfo layout that
userspace expects.

Modify fill_siginfo_note to use copy_siginfo_to_external instead of
set_fs and copy_siginfo_to_user.

Update compat_binfmt_elf.c to use the previously added
copy_siginfo_to_external32 to handle the compat case.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-05 16:46:10 -04:00