Commit Graph

410 Commits

Author SHA1 Message Date
Patrick Talbert 143d5ac2a9 Merge: CVE-2024-50271: ucounts: Split rlimit and ucount values and max values
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/6027

JIRA: https://issues.redhat.com/browse/RHEL-68020

CVE: CVE-2024-50271

- 012f4d5d25e9ef92ee129bd5aa7aa60f692681e1 signal: restore the override_rlimit logic
- de399236e240743ad2dd10d719c37b97ddf31996 ucounts: Split rlimit and ucount values and max values

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>

Approved-by: Rafael Aquini <raquini@redhat.com>
Approved-by: Herton R. Krzesinski <herton@redhat.com>
Approved-by: Waiman Long <longman@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Patrick Talbert <ptalbert@redhat.com>
2025-02-03 10:00:41 -05:00
Radostin Stoyanov 46364cd74c ucounts: Split rlimit and ucount values and max values
JIRA: https://issues.redhat.com/browse/RHEL-68020
CVE: CVE-2024-50271

commit de399236e240743ad2dd10d719c37b97ddf31996
Author: Alexey Gladkov <legion@kernel.org>
Date:   Wed Mat 18 19:17:30 2022 +0200

    ucounts: Split rlimit and ucount values and max values

    Since the semantics of maximum rlimit values are different, it would be
    better not to mix ucount and rlimit values. This will prevent the error
    of using inc_count/dec_ucount for rlimit parameters.

    This patch also renames the functions to emphasize the lack of
    connection between rlimit and ucount.

    v3:
    - Fix BUG:KASAN:use-after-free_in_dec_ucount.

    v2:
    - Fix the array-index-out-of-bounds that was found by the lkp project.

    Reported-by: kernel test robot <oliver.sang@intel.com>
    Signed-off-by: Alexey Gladkov <legion@kernel.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Link: https://lkml.kernel.org/r/20220518171730.l65lmnnjtnxnftpq@example.org
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Radostin Stoyanov <radostin@redhat.com>
2024-12-20 15:31:08 +00:00
Herton R. Krzesinski b20a7c8f16 prctl: generalize PR_SET_MDWE support check to be per-arch
JIRA: https://issues.redhat.com/browse/RHEL-68912

commit d5aad4c2ca057e760a92a9a7d65bd38d72963f27
Author: Zev Weiss <zev@bewilderbeest.net>
Date:   Mon Feb 26 17:35:41 2024 -0800

    prctl: generalize PR_SET_MDWE support check to be per-arch

    Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".

    I noticed after a recent kernel update that my ARM926 system started
    segfaulting on any execve() after calling prctl(PR_SET_MDWE).  After some
    investigation it appears that ARMv5 is incapable of providing the
    appropriate protections for MDWE, since any readable memory is also
    implicitly executable.

    The prctl_set_mdwe() function already had some special-case logic added
    disabling it on PARISC (commit 793838138c15, "prctl: Disable
    prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that
    check to use an arch_*() function, and (2) adds a corresponding override
    for ARM to disable MDWE on pre-ARMv6 CPUs.

    With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and
    subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can
    succeed instead of unconditionally failing; on ARMv6 the prctl works as it
    did previously.

    [0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/

    This patch (of 2):

    There exist systems other than PARISC where MDWE may not be feasible to
    support; rather than cluttering up the generic code with additional
    arch-specific logic let's add a generic function for checking MDWE support
    and allow each arch to override it as needed.

    Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net
    Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.net
    Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
    Acked-by: Helge Deller <deller@gmx.de>  [parisc]
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Florent Revest <revest@chromium.org>
    Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
    Cc: Josh Triplett <josh@joshtriplett.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Miguel Ojeda <ojeda@kernel.org>
    Cc: Mike Rapoport (IBM) <rppt@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Ondrej Mosnacek <omosnace@redhat.com>
    Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
    Cc: Russell King (Oracle) <linux@armlinux.org.uk>
    Cc: Sam James <sam@gentoo.org>
    Cc: Stefan Roesch <shr@devkernel.io>
    Cc: Yang Shi <yang@os.amperecomputing.com>
    Cc: Yin Fengwei <fengwei.yin@intel.com>
    Cc: <stable@vger.kernel.org>    [6.3+]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2024-12-09 16:30:34 -03:00
Herton R. Krzesinski 78adbc699e prctl: Disable prctl(PR_SET_MDWE) on parisc
JIRA: https://issues.redhat.com/browse/RHEL-68912

commit 793838138c157d4c49f4fb744b170747e3dabf58
Author: Helge Deller <deller@gmx.de>
Date:   Sat Nov 18 19:33:35 2023 +0100

    prctl: Disable prctl(PR_SET_MDWE) on parisc

    systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
    functionality, but fails on parisc which still needs executable stacks in
    certain combinations of gcc/glibc/kernel.

    Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
    userspace has catched up.

    Signed-off-by: Helge Deller <deller@gmx.de>
    Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
    Reported-by: Sam James <sam@gentoo.org>
    Closes: https://github.com/systemd/systemd/issues/29775
    Tested-by: Sam James <sam@gentoo.org>
    Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
    Cc: <stable@vger.kernel.org> # v6.3+

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2024-12-09 16:30:34 -03:00
Mamatha Inamdar 2bbc1dbdf1 powerpc/dexcr: Add DEXCR prctl interface
JIRA: https://issues.redhat.com/browse/RHEL-52762

commit 628d701f2de5b9a16d1dd82bea68fd895f56f1a1
Author: Benjamin Gray <bgray@linux.ibm.com>
Date:   Wed Apr 17 21:23:20 2024 +1000

    powerpc/dexcr: Add DEXCR prctl interface

    Now that we track a DEXCR on a per-task basis, individual tasks are free
    to configure it as they like.

    The interface is a pair of getter/setter prctl's that work on a single
    aspect at a time (multiple aspects at once is more difficult if there
    are different rules applied for each aspect, now or in future). The
    getter shows the current state of the process config, and the setter
    allows setting/clearing the aspect.

    Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
    [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines]
    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
    Link: https://msgid.link/20240417112325.728010-5-bgray@linux.ibm.com

Signed-off-by: Mamatha Inamdar <minamdar@redhat.com>
2024-10-04 01:55:31 -04:00
Nico Pache 12c51977be mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl
commit 24e41bf8a6b424c76c5902fb999e9eca61bdf83d
Author: Florent Revest <revest@chromium.org>
Date:   Mon Aug 28 17:08:57 2023 +0200

    mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl

    This extends the current PR_SET_MDWE prctl arg with a bit to indicate that
    the process doesn't want MDWE protection to propagate to children.

    To implement this no-inherit mode, the tag in current->mm->flags must be
    absent from MMF_INIT_MASK.  This means that the encoding for "MDWE but
    without inherit" is different in the prctl than in the mm flags.  This
    leads to a bit of bit-mangling in the prctl implementation.

    Link: https://lkml.kernel.org/r/20230828150858.393570-6-revest@chromium.org
    Signed-off-by: Florent Revest <revest@chromium.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Alexey Izbyshev <izbyshev@ispras.ru>
    Cc: Anshuman Khandual <anshuman.khandual@arm.com>
    Cc: Ayush Jain <ayush.jain3@amd.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: Joey Gouly <joey.gouly@arm.com>
    Cc: KP Singh <kpsingh@kernel.org>
    Cc: Mark Brown <broonie@kernel.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
    Cc: Topi Miettinen <toiwoton@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
2024-04-30 17:51:24 -06:00
Chris von Recklinghausen 3f9448e50b mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
JIRA: https://issues.redhat.com/browse/RHEL-27741

commit 24139c07f413ef4b555482c758343d71392a19bc
Author: David Hildenbrand <david@redhat.com>
Date:   Sat Apr 22 22:54:18 2023 +0200

    mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0

    Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
    disabling KSM", v2.

    (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
    does, (2) add a selftest for it and (3) factor out disabling of KSM from
    s390/gmap code.

    This patch (of 3):

    Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
    the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
    only do that if we previously set PR_SET_MEMORY_MERGE=1.

    Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.com
    Signed-off-by: David Hildenbrand <david@redhat.com>
    Acked-by: Stefan Roesch <shr@devkernel.io>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Janosch Frank <frankja@linux.ibm.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:08 -04:00
Chris von Recklinghausen 3c00c5a05f mm: add new api to enable ksm per process
Conflicts: include/uapi/linux/prctl.h - fuzz

JIRA: https://issues.redhat.com/browse/RHEL-27741

commit d7597f59d1d33e9efbffa7060deb9ee5bd119e62
Author: Stefan Roesch <shr@devkernel.io>
Date:   Mon Apr 17 22:13:40 2023 -0700

    mm: add new api to enable ksm per process

    Patch series "mm: process/cgroup ksm support", v9.

    So far KSM can only be enabled by calling madvise for memory regions.  To
    be able to use KSM for more workloads, KSM needs to have the ability to be
    enabled / disabled at the process / cgroup level.

    Use case 1:
      The madvise call is not available in the programming language.  An
      example for this are programs with forked workloads using a garbage
      collected language without pointers.  In such a language madvise cannot
      be made available.

      In addition the addresses of objects get moved around as they are
      garbage collected.  KSM sharing needs to be enabled "from the outside"
      for these type of workloads.

    Use case 2:
      The same interpreter can also be used for workloads where KSM brings
      no benefit or even has overhead.  We'd like to be able to enable KSM on
      a workload by workload basis.

    Use case 3:
      With the madvise call sharing opportunities are only enabled for the
      current process: it is a workload-local decision.  A considerable number
      of sharing opportunities may exist across multiple workloads or jobs (if
      they are part of the same security domain).  Only a higler level entity
      like a job scheduler or container can know for certain if its running
      one or more instances of a job.  That job scheduler however doesn't have
      the necessary internal workload knowledge to make targeted madvise
      calls.

    Security concerns:

      In previous discussions security concerns have been brought up.  The
      problem is that an individual workload does not have the knowledge about
      what else is running on a machine.  Therefore it has to be very
      conservative in what memory areas can be shared or not.  However, if the
      system is dedicated to running multiple jobs within the same security
      domain, its the job scheduler that has the knowledge that sharing can be
      safely enabled and is even desirable.

    Performance:

      Experiments with using UKSM have shown a capacity increase of around 20%.

      Here are the metrics from an instagram workload (taken from a machine
      with 64GB main memory):

       full_scans: 445
       general_profit: 20158298048
       max_page_sharing: 256
       merge_across_nodes: 1
       pages_shared: 129547
       pages_sharing: 5119146
       pages_to_scan: 4000
       pages_unshared: 1760924
       pages_volatile: 10761341
       run: 1
       sleep_millisecs: 20
       stable_node_chains: 167
       stable_node_chains_prune_millisecs: 2000
       stable_node_dups: 2751
       use_zero_pages: 0
       zero_pages_sharing: 0

    After the service is running for 30 minutes to an hour, 4 to 5 million
    shared pages are common for this workload when using KSM.

    Detailed changes:

    1. New options for prctl system command
       This patch series adds two new options to the prctl system call.
       The first one allows to enable KSM at the process level and the second
       one to query the setting.

    The setting will be inherited by child processes.

    With the above setting, KSM can be enabled for the seed process of a cgroup
    and all processes in the cgroup will inherit the setting.

    2. Changes to KSM processing
       When KSM is enabled at the process level, the KSM code will iterate
       over all the VMA's and enable KSM for the eligible VMA's.

       When forking a process that has KSM enabled, the setting will be
       inherited by the new child process.

    3. Add general_profit metric
       The general_profit metric of KSM is specified in the documentation,
       but not calculated.  This adds the general profit metric to
       /sys/kernel/debug/mm/ksm.

    4. Add more metrics to ksm_stat
       This adds the process profit metric to /proc/<pid>/ksm_stat.

    5. Add more tests to ksm_tests and ksm_functional_tests
       This adds an option to specify the merge type to the ksm_tests.
       This allows to test madvise and prctl KSM.

       It also adds a two new tests to ksm_functional_tests: one to test
       the new prctl options and the other one is a fork test to verify that
       the KSM process setting is inherited by client processes.

    This patch (of 3):

    So far KSM can only be enabled by calling madvise for memory regions.  To
    be able to use KSM for more workloads, KSM needs to have the ability to be
    enabled / disabled at the process / cgroup level.

    1. New options for prctl system command

       This patch series adds two new options to the prctl system call.
       The first one allows to enable KSM at the process level and the second
       one to query the setting.

       The setting will be inherited by child processes.

       With the above setting, KSM can be enabled for the seed process of a
       cgroup and all processes in the cgroup will inherit the setting.

    2. Changes to KSM processing

       When KSM is enabled at the process level, the KSM code will iterate
       over all the VMA's and enable KSM for the eligible VMA's.

       When forking a process that has KSM enabled, the setting will be
       inherited by the new child process.

      1) Introduce new MMF_VM_MERGE_ANY flag

         This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
         is set, kernel samepage merging (ksm) gets enabled for all vma's of a
         process.

      2) Setting VM_MERGEABLE on VMA creation

         When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
         VM_MERGEABLE flag will be set for this VMA.

      3) support disabling of ksm for a process

         This adds the ability to disable ksm for a process if ksm has been
         enabled for the process with prctl.

      4) add new prctl option to get and set ksm for a process

         This adds two new options to the prctl system call
         - enable ksm for all vmas of a process (if the vmas support it).
         - query if ksm has been enabled for a process.

    3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

       In the s390 architecture when storage keys are used, the
       MMF_VM_MERGE_ANY will be disabled.

    Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
    Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Acked-by: David Hildenbrand <david@redhat.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Bagas Sanjaya <bagasdotme@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2024-04-30 07:01:04 -04:00
Aristeu Rozanski c7f3247991 mm: implement memory-deny-write-execute as a prctl
JIRA: https://issues.redhat.com/browse/RHEL-27740
Tested: by me

commit b507808ebce23561d4ff8c2aa1fb949fe402bc61
Author: Joey Gouly <joey.gouly@arm.com>
Date:   Thu Jan 19 16:03:43 2023 +0000

    mm: implement memory-deny-write-execute as a prctl

    Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
    v2.

    The background to this is that systemd has a configuration option called
    MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter.  Its aim
    is to prevent a user task from inadvertently creating an executable
    mapping that is (or was) writeable.  Since such BPF filter is stateless,
    it cannot detect mappings that were previously writeable but subsequently
    changed to read-only.  Therefore the filter simply rejects any
    mprotect(PROT_EXEC).  The side-effect is that on arm64 with BTI support
    (Branch Target Identification), the dynamic loader cannot change an ELF
    section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().  For
    libraries, it can resort to unmapping and re-mapping but for the main
    executable it does not have a file descriptor.  The original bug report in
    the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
    - [4].

    This series adds in-kernel support for this feature as a prctl
    PR_SET_MDWE, that is inherited on fork().  The prctl denies PROT_WRITE |
    PROT_EXEC mappings.  Like the systemd BPF filter it also denies adding
    PROT_EXEC to mappings.  However unlike the BPF filter it only denies it if
    the mapping didn't previous have PROT_EXEC.  This allows to PROT_EXEC ->
    PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
    filter.

    This patch (of 2):

    The aim of such policy is to prevent a user task from creating an
    executable mapping that is also writeable.

    An example of mmap() returning -EACCESS if the policy is enabled:

            mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

    Similarly, mprotect() would return -EACCESS below:

            addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
            mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

    The BPF filter that systemd MDWE uses is stateless, and disallows
    mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
    be enabled if it was already PROT_EXEC, which allows the following case:

            addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
            mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

    where PROT_BTI enables branch tracking identification on arm64.

    Link: https://lkml.kernel.org/r/20230119160344.54358-1-joey.gouly@arm.com
    Link: https://lkml.kernel.org/r/20230119160344.54358-2-joey.gouly@arm.com
    Signed-off-by: Joey Gouly <joey.gouly@arm.com>
    Co-developed-by: Catalin Marinas <catalin.marinas@arm.com>
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Jeremy Linton <jeremy.linton@arm.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Lennart Poettering <lennart@poettering.net>
    Cc: Mark Brown <broonie@kernel.org>
    Cc: nd <nd@arm.com>
    Cc: Shuah Khan <shuah@kernel.org>
    Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
    Cc: Topi Miettinen <toiwoton@gmail.com>
    Cc: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Aristeu Rozanski <arozansk@redhat.com>
2024-04-29 14:33:11 -04:00
Alex Gladkov 004ada6a63 prlimit: do_prlimit needs to have a speculation check
Bugzilla: https://bugzilla.redhat.com/2196316
CVE: CVE-2023-0458
Upstream Status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=739790605705ddcf18f21782b9c99ad7d53a8c11

commit 739790605705ddcf18f21782b9c99ad7d53a8c11
Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date:   Fri Jan 20 11:03:20 2023 +0100

    prlimit: do_prlimit needs to have a speculation check

    do_prlimit() adds the user-controlled resource value to a pointer that
    will subsequently be dereferenced.  In order to help prevent this
    codepath from being used as a spectre "gadget" a barrier needs to be
    added after checking the range.

    Reported-by: Jordy Zomer <jordyzomer@google.com>
    Tested-by: Jordy Zomer <jordyzomer@google.com>
    Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Signed-off-by: Alex Gladkov <agladkov@redhat.com>
2023-05-09 13:46:37 +02:00
Mark Salter 36ad7ba002 arm64/sme: Implement vector length configuration prctl()s
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2122232

commit 9e4ab6c89109472082616f8d2f6ada7deaffe161
Author: Mark Brown <broonie@kernel.org>
Date: Tue, 19 Apr 2022 12:22:19 +0100

    As for SVE provide a prctl() interface which allows processes to
    configure their SME vector length.

    Signed-off-by: Mark Brown <broonie@kernel.org>
    Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
    Link: https://lore.kernel.org/r/20220419112247.711548-12-broonie@kernel.org
    Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>

Signed-off-by: Mark Salter <msalter@redhat.com>
2023-01-28 11:34:52 -05:00
Chris von Recklinghausen 71958349db mm: refactor vm_area_struct::anon_vma_name usage code
Bugzilla: https://bugzilla.redhat.com/2120352

commit 5c26f6ac9416b63d093e29c30e79b3297e425472
Author: Suren Baghdasaryan <surenb@google.com>
Date:   Fri Mar 4 20:28:51 2022 -0800

    mm: refactor vm_area_struct::anon_vma_name usage code

    Avoid mixing strings and their anon_vma_name referenced pointers by
    using struct anon_vma_name whenever possible.  This simplifies the code
    and allows easier sharing of anon_vma_name structures when they
    represent the same name.

    [surenb@google.com: fix comment]

    Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com
    Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Suggested-by: Matthew Wilcox <willy@infradead.org>
    Suggested-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Cc: Colin Cross <ccross@google.com>
    Cc: Sumit Semwal <sumit.semwal@linaro.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: Alexey Gladkov <legion@kernel.org>
    Cc: Sasha Levin <sashal@kernel.org>
    Cc: Chris Hyser <chris.hyser@oracle.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Peter Collingbourne <pcc@google.com>
    Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Cyrill Gorcunov <gorcunov@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:46 -04:00
Chris von Recklinghausen 70649ff1fb mm: add a field to store names for private anonymous memory
Bugzilla: https://bugzilla.redhat.com/2120352

commit 9a10064f5625d5572c3626c1516e0bebc6c9fe9b
Author: Colin Cross <ccross@google.com>
Date:   Fri Jan 14 14:05:59 2022 -0800

    mm: add a field to store names for private anonymous memory

    In many userspace applications, and especially in VM based applications
    like Android uses heavily, there are multiple different allocators in
    use.  At a minimum there is libc malloc and the stack, and in many cases
    there are libc malloc, the stack, direct syscalls to mmap anonymous
    memory, and multiple VM heaps (one for small objects, one for big
    objects, etc.).  Each of these layers usually has its own tools to
    inspect its usage; malloc by compiling a debug version, the VM through
    heap inspection tools, and for direct syscalls there is usually no way
    to track them.

    On Android we heavily use a set of tools that use an extended version of
    the logic covered in Documentation/vm/pagemap.txt to walk all pages
    mapped in userspace and slice their usage by process, shared (COW) vs.
    unique mappings, backing, etc.  This can account for real physical
    memory usage even in cases like fork without exec (which Android uses
    heavily to share as many private COW pages as possible between
    processes), Kernel SamePage Merging, and clean zero pages.  It produces
    a measurement of the pages that only exist in that process (USS, for
    unique), and a measurement of the physical memory usage of that process
    with the cost of shared pages being evenly split between processes that
    share them (PSS).

    If all anonymous memory is indistinguishable then figuring out the real
    physical memory usage (PSS) of each heap requires either a pagemap
    walking tool that can understand the heap debugging of every layer, or
    for every layer's heap debugging tools to implement the pagemap walking
    logic, in which case it is hard to get a consistent view of memory
    across the whole system.

    Tracking the information in userspace leads to all sorts of problems.
    It either needs to be stored inside the process, which means every
    process has to have an API to export its current heap information upon
    request, or it has to be stored externally in a filesystem that somebody
    needs to clean up on crashes.  It needs to be readable while the process
    is still running, so it has to have some sort of synchronization with
    every layer of userspace.  Efficiently tracking the ranges requires
    reimplementing something like the kernel vma trees, and linking to it
    from every layer of userspace.  It requires more memory, more syscalls,
    more runtime cost, and more complexity to separately track regions that
    the kernel is already tracking.

    This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
    userspace-provided name for anonymous vmas.  The names of named
    anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
    [anon:<name>].

    Userspace can set the name for a region of memory by calling

       prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)

    Setting the name to NULL clears it.  The name length limit is 80 bytes
    including NUL-terminator and is checked to contain only printable ascii
    characters (including space), except '[',']','\','$' and '`'.

    Ascii strings are being used to have a descriptive identifiers for vmas,
    which can be understood by the users reading /proc/pid/maps or
    /proc/pid/smaps.  Names can be standardized for a given system and they
    can include some variable parts such as the name of the allocator or a
    library, tid of the thread using it, etc.

    The name is stored in a pointer in the shared union in vm_area_struct
    that points to a null terminated string.  Anonymous vmas with the same
    name (equivalent strings) and are otherwise mergeable will be merged.
    The name pointers are not shared between vmas even if they contain the
    same name.  The name pointer is stored in a union with fields that are
    only used on file-backed mappings, so it does not increase memory usage.

    CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
    feature.  It keeps the feature disabled by default to prevent any
    additional memory overhead and to avoid confusing procfs parsers on
    systems which are not ready to support named anonymous vmas.

    The patch is based on the original patch developed by Colin Cross, more
    specifically on its latest version [1] posted upstream by Sumit Semwal.
    It used a userspace pointer to store vma names.  In that design, name
    pointers could be shared between vmas.  However during the last
    upstreaming attempt, Kees Cook raised concerns [2] about this approach
    and suggested to copy the name into kernel memory space, perform
    validity checks [3] and store as a string referenced from
    vm_area_struct.

    One big concern is about fork() performance which would need to strdup
    anonymous vma names.  Dave Hansen suggested experimenting with
    worst-case scenario of forking a process with 64k vmas having longest
    possible names [4].  I ran this experiment on an ARM64 Android device
    and recorded a worst-case regression of almost 40% when forking such a
    process.

    This regression is addressed in the followup patch which replaces the
    pointer to a name with a refcounted structure that allows sharing the
    name pointer between vmas of the same name.  Instead of duplicating the
    string during fork() or when splitting a vma it increments the refcount.

    [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
    [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
    [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
    [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/

    Changes for prctl(2) manual page (in the options section):

    PR_SET_VMA
            Sets an attribute specified in arg2 for virtual memory areas
            starting from the address specified in arg3 and spanning the
            size specified  in arg4. arg5 specifies the value of the attribute
            to be set. Note that assigning an attribute to a virtual memory
            area might prevent it from being merged with adjacent virtual
            memory areas due to the difference in that attribute's value.

            Currently, arg2 must be one of:

            PR_SET_VMA_ANON_NAME
                    Set a name for anonymous virtual memory areas. arg5 should
                    be a pointer to a null-terminated string containing the
                    name. The name length including null byte cannot exceed
                    80 bytes. If arg5 is NULL, the name of the appropriate
                    anonymous virtual memory areas will be reset. The name
                    can contain only printable ascii characters (including
                    space), except '[',']','\','$' and '`'.

                    This feature is available only if the kernel is built with
                    the CONFIG_ANON_VMA_NAME option enabled.

    [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
      Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
    [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
     added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
     work here was done by Colin Cross, therefore, with his permission, keeping
     him as the author]

    Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
    Signed-off-by: Colin Cross <ccross@google.com>
    Signed-off-by: Suren Baghdasaryan <surenb@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Cyrill Gorcunov <gorcunov@openvz.org>
    Cc: Dave Hansen <dave.hansen@intel.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Jan Glauber <jan.glauber@gmail.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: John Stultz <john.stultz@linaro.org>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pekka Enberg <penberg@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rob Landley <rob@landley.net>
    Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
    Cc: Shaohua Li <shli@fusionio.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2022-10-12 07:27:38 -04:00
Phil Auld 0f4a42c9d5 prlimit: do not grab the tasklist_lock
Bugzilla: https://bugzilla.redhat.com/2078906

commit 18c91bb2d87268d23868bf13508f5bc9cf04e89a
Author: Barret Rhoden <brho@google.com>
Date:   Thu Jan 6 12:20:41 2022 -0500

    prlimit: do not grab the tasklist_lock

    Unnecessarily grabbing the tasklist_lock can be a scalability bottleneck
    for workloads that also must grab the tasklist_lock for waiting,
    killing, and cloning.

    The tasklist_lock was grabbed to protect tsk->sighand from disappearing
    (becoming NULL).  tsk->signal was already protected by holding a
    reference to tsk.

    update_rlimit_cpu() assumed tsk->sighand != NULL.  With this commit, it
    attempts to lock_task_sighand().  However, this means that
    update_rlimit_cpu() can fail.  This only happens when a task is exiting.
    Note that during exec, sighand may *change*, but it will not be NULL.

    Prior to this commit, the do_prlimit() ensured that update_rlimit_cpu()
    would not fail by read locking the tasklist_lock and checking tsk->sighand
    != NULL.

    If update_rlimit_cpu() fails, there may be other tasks that are not
    exiting that share tsk->signal.  However, the group_leader is the last
    task to be released, so if we cannot update_rlimit_cpu(group_leader),
    then the entire process is exiting.

    The only other caller of update_rlimit_cpu() is
    selinux_bprm_committing_creds().  It has tsk == current, so
    update_rlimit_cpu() cannot fail (current->sighand cannot disappear
    until current exits).

    This change resulted in a 14% speedup on a microbenchmark where parents
    kill and wait on their children, and children getpriority, setpriority,
    and getrlimit.

    Signed-off-by: Barret Rhoden <brho@google.com>
    Link: https://lkml.kernel.org/r/20220106172041.522167-4-brho@google.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-01 13:54:12 -04:00
Phil Auld 4b277ff048 prlimit: make do_prlimit() static
Bugzilla: https://bugzilla.redhat.com/2078906

commit c57bef0287dd5deeabaea5727950559fb9037cd9
Author: Barret Rhoden <brho@google.com>
Date:   Thu Jan 6 12:20:40 2022 -0500

    prlimit: make do_prlimit() static

    There are no other callers in the kernel.

    Fixed up a comment format and whitespace issue when moving do_prlimit()
    higher in sys.c.

    Signed-off-by: Barret Rhoden <brho@google.com>
    Link: https://lkml.kernel.org/r/20220106172041.522167-3-brho@google.com
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Signed-off-by: Phil Auld <pauld@redhat.com>
2022-06-01 13:54:12 -04:00
Alexey Gladkov 7aadcd187c ucounts: Move RLIMIT_NPROC handling after set_user
Bugzilla: https://bugzilla.redhat.com/2061724
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit c923a8e7edb010da67424077cbf1a6f1396ebd2e
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Mon Feb 14 09:40:25 2022 -0600

    ucounts: Move RLIMIT_NPROC handling after set_user

    During set*id() which cred->ucounts to charge the the current process
    to is not known until after set_cred_ucounts.  So move the
    RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call
    flag_nproc_exceeded after set_cred_ucounts.

    This is very much an arbitrary subset of the places where we currently
    change the RLIMIT_NPROC accounting, designed to preserve the existing
    logic.

    Fixing the existing logic will be the subject of another series of
    changes.

    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20220216155832.680775-4-ebiederm@xmission.com
    Fixes: 21d1c5e386 ("Reimplement RLIMIT_NPROC on top of ucounts")
    Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Signed-off-by: Alexey Gladkov <agladkov@redhat.com>
2022-03-08 13:25:22 +01:00
Rafael Aquini 8a0e26470c kernel/fork: factor out replacing the current MM exe_file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2023396

This patch is a backport of the following upstream commit:
commit 35d7bdc86031a2c1ae05ac27dfa93b2acdcbaecc
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Apr 23 10:20:25 2021 +0200

    kernel/fork: factor out replacing the current MM exe_file

    Let's factor the main logic out into replace_mm_exe_file(), such that
    all mm->exe_file logic is contained in kernel/fork.c.

    While at it, perform some simple cleanups that are possible now that
    we're simplifying the individual functions.

    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: David Hildenbrand <david@redhat.com>

Signed-off-by: Rafael Aquini <aquini@redhat.com>
2021-11-29 11:40:30 -05:00
Linus Torvalds c54b245d01 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman:
 "This is the work mainly by Alexey Gladkov to limit rlimits to the
  rlimits of the user that created a user namespace, and to allow users
  to have stricter limits on the resources created within a user
  namespace."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  cred: add missing return error code when set_cred_ucounts() failed
  ucounts: Silence warning in dec_rlimit_ucounts
  ucounts: Set ucount_max to the largest positive value the type can hold
  kselftests: Add test to check for rlimit changes in different user namespaces
  Reimplement RLIMIT_MEMLOCK on top of ucounts
  Reimplement RLIMIT_SIGPENDING on top of ucounts
  Reimplement RLIMIT_MSGQUEUE on top of ucounts
  Reimplement RLIMIT_NPROC on top of ucounts
  Use atomic_t for ucounts reference counting
  Add a reference to ucounts for each cred
  Increase size of ucounts to atomic_long_t
2021-06-28 20:39:26 -07:00
Chris Hyser 7ac592aa35 sched: prctl() core-scheduling interface
This patch provides support for setting and copying core scheduling
'task cookies' between threads (PID), processes (TGID), and process
groups (PGID).

The value of core scheduling isn't that tasks don't share a core,
'nosmt' can do that. The value lies in exploiting all the sharing
opportunities that exist to recover possible lost performance and that
requires a degree of flexibility in the API.

From a security perspective (and there are others), the thread,
process and process group distinction is an existent hierarchal
categorization of tasks that reflects many of the security concerns
about 'data sharing'. For example, protecting against cache-snooping
by a thread that can just read the memory directly isn't all that
useful.

With this in mind, subcommands to CREATE/SHARE (TO/FROM) provide a
mechanism to create and share cookies. CREATE/SHARE_TO specify a
target pid with enum pidtype used to specify the scope of the targeted
tasks. For example, PIDTYPE_TGID will share the cookie with the
process and all of it's threads as typically desired in a security
scenario.

API:

  prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, tgtpid, pidtype, &cookie)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, tgtpid, pidtype, NULL)
  prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, srcpid, pidtype, NULL)

where 'tgtpid/srcpid == 0' implies the current process and pidtype is
kernel enum pid_type {PIDTYPE_PID, PIDTYPE_TGID, PIDTYPE_PGID, ...}.

For return values, EINVAL, ENOMEM are what they say. ESRCH means the
tgtpid/srcpid was not found. EPERM indicates lack of PTRACE permission
access to tgtpid/srcpid. ENODEV indicates your machines lacks SMT.

[peterz: complete rewrite]
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Don Hiatt <dhiatt@digitalocean.com>
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210422123309.039845339@infradead.org
2021-05-12 11:43:31 +02:00
Xiaofeng Cao 5afe69c2cc kernel/sys.c: fix typo
change 'infite'     to 'infinite'
change 'concurent'  to 'concurrent'
change 'memvers'    to 'members'
change 'decendants' to 'descendants'
change 'argumets'   to 'arguments'

Link: https://lkml.kernel.org/r/20210316112904.10661-1-cxfcosmos@gmail.com
Signed-off-by: Xiaofeng Cao <caoxiaofeng@yulong.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:34 -07:00
Alexey Gladkov 21d1c5e386 Reimplement RLIMIT_NPROC on top of ucounts
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

To illustrate the impact of rlimits, let's say there is a program that
does not fork. Some service-A wants to run this program as user X in
multiple containers. Since the program never fork the service wants to
set RLIMIT_NPROC=1.

service-A
 \- program (uid=1000, container1, rlimit_nproc=1)
 \- program (uid=1000, container2, rlimit_nproc=1)

The service-A sets RLIMIT_NPROC=1 and runs the program in container1.
When the service-A tries to run a program with RLIMIT_NPROC=1 in
container2 it fails since user X already has one running process.

We cannot use existing inc_ucounts / dec_ucounts because they do not
allow us to exceed the maximum for the counter. Some rlimits can be
overlimited by root or if the user has the appropriate capability.

Changelog

v11:
* Change inc_rlimit_ucounts() which now returns top value of ucounts.
* Drop inc_rlimit_ucounts_and_test() because the return code of
  inc_rlimit_ucounts() can be checked.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/c5286a8aa16d2d698c222f7532f3d735c82bc6bc.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:01 -05:00
Alexey Gladkov 905ae01c4a Add a reference to ucounts for each cred
For RLIMIT_NPROC and some other rlimits the user_struct that holds the
global limit is kept alive for the lifetime of a process by keeping it
in struct cred. Adding a pointer to ucounts in the struct cred will
allow to track RLIMIT_NPROC not only for user in the system, but for
user in the user_namespace.

Updating ucounts may require memory allocation which may fail. So, we
cannot change cred.ucounts in the commit_creds() because this function
cannot fail and it should always return 0. For this reason, we modify
cred.ucounts before calling the commit_creds().

Changelog

v6:
* Fix null-ptr-deref in is_ucounts_overlimit() detected by trinity. This
  error was caused by the fact that cred_alloc_blank() left the ucounts
  pointer empty.

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/b37aaef28d8b9b0d757e07ba6dd27281bbe39259.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-04-30 14:14:00 -05:00
Peter Collingbourne 201698626f arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.

The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.

This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.

On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.

On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-04-13 17:31:44 +01:00
Alexey Dobriyan c995f12ad8 prctl: fix PR_SET_MM_AUXV kernel stack leak
Doing a

	prctl(PR_SET_MM, PR_SET_MM_AUXV, addr, 1);

will copy 1 byte from userspace to (quite big) on-stack array
and then stash everything to mm->saved_auxv.
AT_NULL terminator will be inserted at the very end.

/proc/*/auxv handler will find that AT_NULL terminator
and copy original stack contents to userspace.

This devious scheme requires CAP_SYS_RESOURCE.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-03-14 14:33:27 -07:00
Linus Torvalds 6fbd6cf85a Kbuild updates for v5.12
- Fix false-positive build warnings for ARCH=ia64 builds
 
  - Optimize dictionary size for module compression with xz
 
  - Check the compiler and linker versions in Kconfig
 
  - Fix misuse of extra-y
 
  - Support DWARF v5 debug info
 
  - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x
    exceeded the limit
 
  - Add generic syscall{tbl,hdr}.sh for cleanups across arches
 
  - Minor cleanups of genksyms
 
  - Minor cleanups of Kconfig
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmA3zhgVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG0C4P/A5hUNFdkYI+EffAWZiHn69t0S8j
 M1GQkZildKu/yOfm6hp3mNwgHmYgw0aAuch1htkJuv+5rXRtoK77yw0xKbUqNHyO
 VqkJWQPVUXJbWIDiu332NaETHbFTWCnPZKGmzcbVOBHbYsXUJPp17gROQ9ke0fQN
 Ae6OV5WINhoS8UnjESWb3qOO87MdQTZ+9mP+NMnVh4kV1SUeMAXLFwFll66KZTkj
 GXB330N3p9L0wQVljhXpQ/YPOd76wJNPhJWJ9+hKLFbWsedovzlHb+duprh1z1xe
 7LLaq9dEbXxe1Uz0qmK76lupXxilYMyUupTW9HIYtIsY8br8DIoBOG0bn46LVnuL
 /m+UQNfUFCYYePT7iZQNNc1DISQJrxme3bjq0PJzZTDukNnHJVahnj9x4RoNaF8j
 Dc+JME0r2i8Ccp28vgmaRgzvSsb8Xtw5icwRdwzIpyt1ubs/+tkd/GSaGzQo30Q8
 m8y1WOjovHNX7OGnOaOWBGoQAX/2k/VHeAediMsPqWUoOxwsLHYxG/4KtgwbJ5vc
 gu/Fyk1GRDklZPpLdYFVvz8TGnqSDogJgF+7WolJ6YvPGAUIDAfd5Ky2sWayddlm
 wchc3sKDVyh3lov23h0WQVTvLO9xl+NZ6THxoAGdYeQ0DUu5OxwH8qje/UpWuo1a
 DchhNN+g5pa6n56Z
 =sLxb
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - Fix false-positive build warnings for ARCH=ia64 builds

 - Optimize dictionary size for module compression with xz

 - Check the compiler and linker versions in Kconfig

 - Fix misuse of extra-y

 - Support DWARF v5 debug info

 - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x
   exceeded the limit

 - Add generic syscall{tbl,hdr}.sh for cleanups across arches

 - Minor cleanups of genksyms

 - Minor cleanups of Kconfig

* tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits)
  initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD
  kbuild: remove deprecated 'always' and 'hostprogs-y/m'
  kbuild: parse C= and M= before changing the working directory
  kbuild: reuse this-makefile to define abs_srctree
  kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig
  kconfig: omit --oldaskconfig option for 'make config'
  kconfig: fix 'invalid option' for help option
  kconfig: remove dead code in conf_askvalue()
  kconfig: clean up nested if-conditionals in check_conf()
  kconfig: Remove duplicate call to sym_get_string_value()
  Makefile: Remove # characters from compiler string
  Makefile: reuse CC_VERSION_TEXT
  kbuild: check the minimum linker version in Kconfig
  kbuild: remove ld-version macro
  scripts: add generic syscallhdr.sh
  scripts: add generic syscalltbl.sh
  arch: syscalls: remove $(srctree)/ prefix from syscall tables
  arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work
  gen_compile_commands: prune some directories
  kbuild: simplify access to the kernel's version
  ...
2021-02-25 10:17:31 -08:00
Linus Torvalds 7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Sasha Levin 88a686728b kbuild: simplify access to the kernel's version
Instead of storing the version in a single integer and having various
kernel (and userspace) code how it's constructed, export individual
(major, patchlevel, sublevel) components and simplify kernel code that
uses it.

This should also make it easier on userspace.

Signed-off-by: Sasha Levin <sashal@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-02-16 12:01:45 +09:00
Viresh Kumar be65de6b03 fs: Remove dcookies support
The dcookies stuff was only used by the kernel's old oprofile code. Now
that oprofile's support is removed from the kernel, there is no need for
dcookies as well. Remove it.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2021-01-29 10:06:46 +05:30
Christian Brauner 02f92b3868
fs: add file and path permissions helpers
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Gabriel Krisman Bertazi 1446e1df9e kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Rasmus Villemoes 986b9eacb2 kernel/sys.c: fix prototype of prctl_get_tid_address()
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in
fact a "pointer to (pointer to int in userspace) in userspace".  So
sparse rightfully complains about passing a kernel pointer to
put_user().

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 11:44:16 -07:00
Linus Torvalds 81ecf91eab SafeSetID changes for v5.10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgvWslnM+qUy+sgVg5n2WYw6TPBAFAl+Ifu8ACgkQ5n2WYw6T
 PBCoxA/+Pn0XvwYa6V773lPNjon+Oa94Aq7Wl6YryDMJakiGDJFSJa0tEI8TmRkJ
 z21kjww2Us9gEvfmNoc0t4oDJ98UNAXERjc98fOZgxH1d1urpGUI7qdQ07YCo0xZ
 CDOvqXk/PobGF6p9BpF5QWqEJNq6G8xAKpA8nLa6OUPcjofHroWCgIs86Rl3CtTc
 DwjcOvCgUoTxFm9Vpvm04njFFkVuGUwmXuhyV3Xjh2vNhHvfpP/ibTPmmv1sx4dO
 9WE8BjW0HL5VMzms/BE/mnXmbu2BdPs+PW9/RjQfebbAH8DM3Noqr9f3Db8eqp7t
 TiqU8AO06TEVZa011+V3aywgz9rnH+XJ17TfutB28Z7lG3s4XPZYDgzubJxb1X8M
 4d2mCL3N/ao5otx6FqpgJ2oK0ZceB/voY9qyyfErEBhRumxifl7AQCHxt3LumH6m
 fvvNY+UcN/n7hZPJ7sgZVi/hnnwvO0e1eX0L9ZdNsDjR1bgzBQCdkY53XNxam+rM
 z7tmT3jlDpNtPzOzFCZeiJuTgWYMDdJFqekPLess/Vqaswzc4PPT2lyQ6N81NR5H
 +mzYf/PNIg5fqN8QlMQEkMTv2fnC19dHJT83NPgy4dQObpXzUqYGWAmdKcBxLpnG
 du8wDpPHusChRFMZKRMTXztdMvMAuNqY+KJ6bFojG0Z+qgR7oQk=
 =/anB
 -----END PGP SIGNATURE-----

Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linux

Pull SafeSetID updates from Micah Morton:
 "The changes are mostly contained to within the SafeSetID LSM, with the
  exception of a few 1-line changes to change some ns_capable() calls to
  ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that
  is examined by SafeSetID code and nothing else in the kernel.

  The changes to SafeSetID internally allow for setting up GID
  transition security policies, as already existed for UIDs"

* tag 'safesetid-5.10' of git://github.com/micah-morton/linux:
  LSM: SafeSetID: Fix warnings reported by test bot
  LSM: SafeSetID: Add GID security policy handling
  LSM: Signal to SafeSetID when setting group IDs
2020-10-25 10:45:26 -07:00
Liao Pingfang 15ec0fcff6 kernel/sys.c: replace do_brk with do_brk_flags in comment of prctl_set_mm_map()
Replace do_brk with do_brk_flags in comment of prctl_set_mm_map(), since
do_brk was removed in following commit.

Fixes: bb177a732c ("mm: do not bug_on on incorrect length in __mm_populate()")
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1600650751-43127-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:19 -07:00
Thomas Cedeno 111767c1d8 LSM: Signal to SafeSetID when setting group IDs
For SafeSetID to properly gate set*gid() calls, it needs to know whether
ns_capable() is being called from within a sys_set*gid() function or is
being called from elsewhere in the kernel. This allows SafeSetID to deny
CAP_SETGID to restricted groups when they are attempting to use the
capability for code paths other than updating GIDs (e.g. setting up
userns GID mappings). This is the identical approach to what is
currently done for CAP_SETUID.

NOTE: We also add signaling to SafeSetID from the setgroups() syscall,
as we have future plans to restrict a process' ability to set
supplementary groups in addition to what is added in this series for
restricting setting of the primary group.

Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-10-13 09:17:34 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Nicolas Viennot 227175b2c9 prctl: exe link permission error changed from -EINVAL to -EPERM
This brings consistency with the rest of the prctl() syscall where
-EPERM is returned when failing a capability check.

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-7-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Nicolas Viennot ebd6de6812 prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
Originally, only a local CAP_SYS_ADMIN could change the exe link,
making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN.
This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN
for permitting changing the exe link.

The following describes the history of the /proc/self/exe permission
checks as it may be difficult to understand what decisions lead to this
point.

* [1] May 2012: This commit introduces the ability of changing
  /proc/self/exe if the user is CAP_SYS_RESOURCE capable.
  In the related discussion [2], no clear thread model is presented for
  what could happen if the /proc/self/exe changes multiple times, or why
  would the admin be at the mercy of userspace.

* [3] Oct 2014: This commit introduces a new API to change
  /proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE,
  but instead checks if the current user is root (uid=0) in its local
  namespace. In the related discussion [4] it is said that "Controlling
  exe_fd without privileges may turn out to be dangerous. At least
  things like tomoyo examine it for making policy decisions (see
  tomoyo_manager())."

* [5] Dec 2016: This commit removes the restriction to change
  /proc/self/exe at most once. The related discussion [6] informs that
  the audit subsystem relies on the exe symlink, presumably
  audit_log_d_path_exe() in kernel/audit.c.

* [7] May 2017: This commit changed the check from uid==0 to local
  CAP_SYS_ADMIN. No discussion.

* [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace
  is demonstrated

Overall, the concrete points that were made to retain capability checks
around changing the exe symlink is that tomoyo_manager() and
audit_log_d_path_exe() uses the exe_file path.

Christian Brauner said that relying on /proc/<pid>/exe being immutable (or
guarded by caps) in a sake of security is a bit misleading. It can only
be used as a hint without any guarantees of what code is being executed
once execve() returns to userspace. Christian suggested that in the
future, we could call audit_log() or similar to inform the admin of all
exe link changes, instead of attempting to provide security guarantees
via permission checks. However, this proposed change requires the
understanding of the security implications in the tomoyo/audit subsystems.

[1] b32dfe3771 ("c/r: prctl: add ability to set new mm_struct::exe_file")
[2] https://lore.kernel.org/patchwork/patch/292515/
[3] f606b77f1a ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[4] https://lore.kernel.org/patchwork/patch/479359/
[5] 3fb4afd9a5 ("prctl: remove one-shot limitation for changing exe link")
[6] https://lore.kernel.org/patchwork/patch/697304/
[7] 4d28df6152 ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[8] https://github.com/nviennot/run_as_exe

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-6-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Linus Torvalds 4a87b197c1 Add additional LSM hooks for SafeSetID
SafeSetID is capable of making allow/deny decisions for set*uid calls
 on a system, and we want to add similar functionality for set*gid
 calls. The work to do that is not yet complete, so probably won't make
 it in for v5.8, but we are looking to get this simple patch in for
 v5.8 since we have it ready. We are planning on the rest of the work
 for extending the SafeSetID LSM being merged during the v5.9 merge
 window.
 
 This patch was sent to the security mailing list and there were no objections.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgvWslnM+qUy+sgVg5n2WYw6TPBAFAl7mZCoACgkQ5n2WYw6T
 PBAk1RAAl8t3/m3lELf8qIir4OAd4nK0kc4e+7W8WkznX2ljUl2IetlNxDCBmEXr
 T5qoW6uPsr6kl5AKnbl9Ii7WpW/halsslpKSUNQCs6zbecoVdxekJ8ISW7xHuboZ
 SvS1bqm+t++PM0c0nWSFEr7eXYmPH8OGbCqu6/+nnbxPZf2rJX03e5LnHkEFDFnZ
 0D/rsKgzMt01pdBJQXeoKk79etHO5MjuAkkYVEKJKCR1fM16lk7ECaCp0KJv1Mmx
 I88VncbLvI+um4t82d1Z8qDr2iLgogjJrMZC4WKfxDTmlmxox2Fz9ZJo+8sIWk6k
 T3a95x0s/mYCO4gWtpCVICt9+71Z3ie9T2iaI+CIe/kJvI/ysb+7LSkF+PD33bdz
 0yv6Y9+VMRdzb3pW69R28IoP4wdYQOJRomsY49z6ypH0RgBWcBvyE6e4v+WJGRNK
 E164Imevf6rrZeqJ0kGSBS1nL9WmQHMaXabAwxg1jK1KRZD+YZj3EKC9S/+PAkaT
 1qXUgvGuXHGjQrwU0hclQjgc6BAudWfAGdfrVr7IWwNKJmjgBf6C35my/azrkOg9
 wHCEpUWVmZZLIZLM69/6QXdmMA+iR+rPz5qlVnWhWTfjRYJUXM455Zk+aNo+Qnwi
 +saCcdU+9xqreLeDIoYoebcV/ctHeW0XCQi/+ebjexXVlyeSfYs=
 =I+0L
 -----END PGP SIGNATURE-----

Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux

Pull SafeSetID update from Micah Morton:
 "Add additional LSM hooks for SafeSetID

  SafeSetID is capable of making allow/deny decisions for set*uid calls
  on a system, and we want to add similar functionality for set*gid
  calls.

  The work to do that is not yet complete, so probably won't make it in
  for v5.8, but we are looking to get this simple patch in for v5.8
  since we have it ready.

  We are planning on the rest of the work for extending the SafeSetID
  LSM being merged during the v5.9 merge window"

* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
  security: Add LSM hooks to set*gid syscalls
2020-06-14 11:39:31 -07:00
Thomas Cedeno 39030e1351 security: Add LSM hooks to set*gid syscalls
The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.

Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-06-14 10:52:02 -07:00
Michel Lespinasse c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Linus Torvalds 94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
NeilBrown a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Al Viro ce5155c4f8 compat sysinfo(2): don't bother with field-by-field copyout
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-25 18:06:05 -04:00
Cyril Hrubis ecc421e05b sys/sysinfo: Respect boottime inside time namespace
The sysinfo() syscall includes uptime in seconds but has no correction for
time namespaces which makes it inconsistent with the /proc/uptime inside of
a time namespace.

Add the missing time namespace adjustment call.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Link: https://lkml.kernel.org/r/20200303150638.7329-1-chrubis@suse.cz
2020-03-03 19:34:32 +01:00
Mike Christie 8d19f1c8e1
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.

Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory.  Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.

[ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195]  ? __schedule+0x29b/0x630
[ 1626.609197]  ? wait_for_completion+0xe0/0x170
[ 1626.609198]  schedule+0x30/0xb0
[ 1626.609200]  schedule_timeout+0x1f6/0x2f0
[ 1626.609202]  ? blk_finish_plug+0x21/0x2e
[ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206]  ? wait_for_completion+0xe0/0x170
[ 1626.609208]  wait_for_completion+0x108/0x170
[ 1626.609210]  ? wake_up_q+0x70/0x70
[ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214]  ? xfs_bwrite+0x25/0x60
[ 1626.609215]  xfs_buf_iowait+0x22/0xf0
[ 1626.609218]  __xfs_buf_submit+0x12e/0x250
[ 1626.609220]  xfs_bwrite+0x25/0x60
[ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228]  super_cache_scan+0x152/0x1a0
[ 1626.609231]  do_shrink_slab+0x12c/0x2d0
[ 1626.609233]  shrink_slab+0x9c/0x2a0
[ 1626.609235]  shrink_node+0xd7/0x470
[ 1626.609237]  do_try_to_free_pages+0xbf/0x380
[ 1626.609240]  try_to_free_pages+0xd9/0x1f0
[ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251]  ? ___slab_alloc+0x238/0x560
[ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259]  skb_page_frag_refill+0x97/0xd0
[ 1626.609274]  sk_page_frag_refill+0x1d/0x80
[ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304]  tcp_sendmsg+0x27/0x40
[ 1626.609307]  sock_sendmsg+0x54/0x60
[ 1626.609308]  ___sys_sendmsg+0x29f/0x320
[ 1626.609313]  ? sock_poll+0x66/0xb0
[ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
[ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
[ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334]  ? ep_poll+0x26c/0x4a0
[ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339]  ? release_sock+0x43/0x90
[ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342]  __sys_sendmsg+0x47/0x80
[ 1626.609347]  do_syscall_64+0x5f/0x1c0
[ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-28 10:09:51 +01:00
Joe Perches 5e1aada08c kernel/sys.c: avoid copying possible padding bytes in copy_to_user
Initialization is not guaranteed to zero padding bytes so use an
explicit memset instead to avoid leaking any kernel content in any
possible padding bytes.

Link: http://lkml.kernel.org/r/dfa331c00881d61c8ee51577a082d8bebd61805c.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Arnd Bergmann bdd565f817 y2038: rusage: use __kernel_old_timeval
There are two 'struct timeval' fields in 'struct rusage'.

Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.

While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.

In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.

Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Linus Torvalds 7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Linus Torvalds 22331f8952 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu-feature updates from Ingo Molnar:

 - Rework the Intel model names symbols/macros, which were decades of
   ad-hoc extensions and added random noise. It's now a coherent, easy
   to follow nomenclature.

 - Add new Intel CPU model IDs:
    - "Tiger Lake" desktop and mobile models
    - "Elkhart Lake" model ID
    - and the "Lightning Mountain" variant of Airmont, plus support code

 - Add the new AVX512_VP2INTERSECT instruction to cpufeatures

 - Remove Intel MPX user-visible APIs and the self-tests, because the
   toolchain (gcc) is not supporting it going forward. This is the
   first, lowest-risk phase of MPX removal.

 - Remove X86_FEATURE_MFENCE_RDTSC

 - Various smaller cleanups and fixes

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/cpu: Update init data for new Airmont CPU model
  x86/cpu: Add new Airmont variant to Intel family
  x86/cpu: Add Elkhart Lake to Intel family
  x86/cpu: Add Tiger Lake to Intel family
  x86: Correct misc typos
  x86/intel: Add common OPTDIFFs
  x86/intel: Aggregate microserver naming
  x86/intel: Aggregate big core graphics naming
  x86/intel: Aggregate big core mobile naming
  x86/intel: Aggregate big core client naming
  x86/cpufeature: Explain the macro duplication
  x86/ftrace: Remove mcount() declaration
  x86/PCI: Remove superfluous returns from void functions
  x86/msr-index: Move AMD MSRs where they belong
  x86/cpu: Use constant definitions for CPU models
  lib: Remove redundant ftrace flag removal
  x86/crash: Remove unnecessary comparison
  x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
  x86: Remove X86_FEATURE_MFENCE_RDTSC
  x86/mpx: Remove MPX APIs
  ...
2019-09-16 18:47:53 -07:00