Commit Graph

12 Commits

Author SHA1 Message Date
Donald Dutile 573fa8ea71 module/decompress: use kvmalloc() consistently
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 17fc8084aa8f9d5235f252fc3978db657dd77e92
Author: Andrea Righi <andrea.righi@canonical.com>
Date:   Thu Nov 2 09:19:14 2023 +0100

    module/decompress: use kvmalloc() consistently

    We consistently switched from kmalloc() to vmalloc() in module
    decompression to prevent potential memory allocation failures with large
    modules, however vmalloc() is not as memory-efficient and fast as
    kmalloc().

    Since we don't know in general the size of the workspace required by the
    decompression algorithm, it is more reasonable to use kvmalloc()
    consistently, also considering that we don't have special memory
    requirements here.

    Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
    Tested-by: Andrea Righi <andrea.righi@canonical.com>
    Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:29 -04:00
Donald Dutile 64d79afe53 module/decompress: use vmalloc() for gzip decompression workspace
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 3737df782c740b944912ed93420c57344b1cf864
Author: Andrea Righi <andrea.righi@canonical.com>
Date:   Wed Aug 30 17:58:20 2023 +0200

    module/decompress: use vmalloc() for gzip decompression workspace

    Use a similar approach as commit a419beac4a07 ("module/decompress: use
    vmalloc() for zstd decompression workspace") and replace kmalloc() with
    vmalloc() also for the gzip module decompression workspace.

    In this case the workspace is represented by struct inflate_workspace
    that can be fairly large for kmalloc() and it can potentially lead to
    allocation errors on certain systems:

    $ pahole inflate_workspace
    struct inflate_workspace {
            struct inflate_state       inflate_state;        /*     0  9544 */
            /* --- cacheline 149 boundary (9536 bytes) was 8 bytes ago --- */
            unsigned char              working_window[32768]; /*  9544 32768 */

            /* size: 42312, cachelines: 662, members: 2 */
            /* last cacheline: 8 bytes */
    };

    Considering that there is no need to use continuous physical memory,
    simply switch to vmalloc() to provide a more reliable in-kernel module
    decompression.

    Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing")
    Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:29 -04:00
Donald Dutile ca8a2f786d module/decompress: use vmalloc() for zstd decompression workspace
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit a419beac4a070aff63c520f36ebf7cb8a76a8ae5
Author: Andrea Righi <andrea.righi@canonical.com>
Date:   Tue Aug 29 14:05:08 2023 +0200

    module/decompress: use vmalloc() for zstd decompression workspace

    Using kmalloc() to allocate the decompression workspace for zstd may
    trigger the following warning when large modules are loaded (i.e., xfs):

    [    2.961884] WARNING: CPU: 1 PID: 254 at mm/page_alloc.c:4453 __alloc_pages+0x2c3/0x350
    ...
    [    2.989033] Call Trace:
    [    2.989841]  <TASK>
    [    2.990614]  ? show_regs+0x6d/0x80
    [    2.991573]  ? __warn+0x89/0x160
    [    2.992485]  ? __alloc_pages+0x2c3/0x350
    [    2.993520]  ? report_bug+0x17e/0x1b0
    [    2.994506]  ? handle_bug+0x51/0xa0
    [    2.995474]  ? exc_invalid_op+0x18/0x80
    [    2.996469]  ? asm_exc_invalid_op+0x1b/0x20
    [    2.997530]  ? module_zstd_decompress+0xdc/0x2a0
    [    2.998665]  ? __alloc_pages+0x2c3/0x350
    [    2.999695]  ? module_zstd_decompress+0xdc/0x2a0
    [    3.000821]  __kmalloc_large_node+0x7a/0x150
    [    3.001920]  __kmalloc+0xdb/0x170
    [    3.002824]  module_zstd_decompress+0xdc/0x2a0
    [    3.003857]  module_decompress+0x37/0xc0
    [    3.004688]  init_module_from_file+0xd0/0x100
    [    3.005668]  idempotent_init_module+0x11c/0x2b0
    [    3.006632]  __x64_sys_finit_module+0x64/0xd0
    [    3.007568]  do_syscall_64+0x59/0x90
    [    3.008373]  ? ksys_read+0x73/0x100
    [    3.009395]  ? exit_to_user_mode_prepare+0x30/0xb0
    [    3.010531]  ? syscall_exit_to_user_mode+0x37/0x60
    [    3.011662]  ? do_syscall_64+0x68/0x90
    [    3.012511]  ? do_syscall_64+0x68/0x90
    [    3.013364]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

    However, continuous physical memory does not seem to be required in
    module_zstd_decompress(), so use vmalloc() instead, to prevent the
    warning and avoid potential failures at loading compressed modules.

    Fixes: 169a58ad824d ("module/decompress: Support zstd in-kernel decompression")
    Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:29 -04:00
Donald Dutile a5ba435c43 module/decompress: Fix error checking on zstd decompression
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit fadb74f9f2f609238070c7ca1b04933dc9400e4a
Author: Lucas De Marchi <lucas.demarchi@intel.com>
Date:   Thu Jun 1 14:23:31 2023 -0700

    module/decompress: Fix error checking on zstd decompression

    While implementing support for in-kernel decompression in kmod,
    finit_module() was returning a very suspicious value:

            finit_module(3, "", MODULE_INIT_COMPRESSED_FILE) = 18446744072717407296

    It turns out the check for module_get_next_page() failing is wrong,
    and hence the decompression was not really taking place. Invert
    the condition to fix it.

    Fixes: 169a58ad824d ("module/decompress: Support zstd in-kernel decompression")
    Cc: stable@kernel.org
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
    Cc: Stephen Boyd <swboyd@chromium.org>
    Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:27 -04:00
Donald Dutile e41f7154cf module: add debug stats to help identify memory pressure
JIRA: https://issues.redhat.com/browse/RHEL-28063

Conflicts:
   Adding RHEL-only MODULE_STATS set to n for now. Possibly
   add in future -debug kernels, to be determined.
   Add rest of commit(s) to enable clean backport for further commits.

commit df3e764d8e5cd416efee29e0de3c93917dff5d33
Author: Luis Chamberlain <mcgrof@kernel.org>
Date:   Tue Mar 28 20:03:19 2023 -0700

    module: add debug stats to help identify memory pressure

    Loading modules with finit_module() can end up using vmalloc(), vmap()
    and vmalloc() again, for a total of up to 3 separate allocations in the
    worst case for a single module. We always kernel_read*() the module,
    that's a vmalloc(). Then vmap() is used for the module decompression,
    and if so the last read buffer is freed as we use the now decompressed
    module buffer to stuff data into our copy module. The last allocation is
    specific to each architectures but pretty much that's generally a series
    of vmalloc() calls or a variation of vmalloc to handle ELF sections with
    special permissions.

    Evaluation with new stress-ng module support [1] with just 100 ops
    is proving that you can end up using GiBs of data easily even with all
    care we have in the kernel and userspace today in trying to not load modules
    which are already loaded. 100 ops seems to resemble the sort of pressure a
    system with about 400 CPUs can create on module loading. Although issues
    relating to duplicate module requests due to each CPU inucurring a new
    module reuest is silly and some of these are being fixed, we currently lack
    proper tooling to help diagnose easily what happened, when it happened
    and who likely is to blame -- userspace or kernel module autoloading.

    Provide an initial set of stats which use debugfs to let us easily scrape
    post-boot information about failed loads. This sort of information can
    be used on production worklaods to try to optimize *avoiding* redundant
    memory pressure using finit_module().

    There's a few examples that can be provided:

    A 255 vCPU system without the next patch in this series applied:

    Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
    graphical.target reached after 6.988s in userspace

    And 13.58 GiB of virtual memory space lost due to failed module loading:

    root@big ~ # cat /sys/kernel/debug/modules/stats
             Mods ever loaded       67
         Mods failed on kread       0
    Mods failed on decompress       0
      Mods failed on becoming       0
          Mods failed on load       1411
            Total module size       11464704
          Total mod text size       4194304
           Failed kread bytes       0
      Failed decompress bytes       0
        Failed becoming bytes       0
            Failed kmod bytes       14588526272
     Virtual mem wasted bytes       14588526272
             Average mod size       171115
        Average mod text size       62602
      Average fail load bytes       10339140
    Duplicate failed modules:
                  module-name        How-many-times                    Reason
                    kvm_intel                   249                      Load
                          kvm                   249                      Load
                    irqbypass                     8                      Load
             crct10dif_pclmul                   128                      Load
          ghash_clmulni_intel                    27                      Load
                 sha512_ssse3                    50                      Load
               sha512_generic                   200                      Load
                  aesni_intel                   249                      Load
                  crypto_simd                    41                      Load
                       cryptd                   131                      Load
                        evdev                     2                      Load
                    serio_raw                     1                      Load
                   virtio_pci                     3                      Load
                         nvme                     3                      Load
                    nvme_core                     3                      Load
        virtio_pci_legacy_dev                     3                      Load
        virtio_pci_modern_dev                     3                      Load
                       t10_pi                     3                      Load
                       virtio                     3                      Load
                 crc32_pclmul                     6                      Load
               crc64_rocksoft                     3                      Load
                 crc32c_intel                    40                      Load
                  virtio_ring                     3                      Load
                        crc64                     3                      Load

    The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
    next patch in this series applied, shows 226.53 MiB are wasted in virtual
    memory allocations which due to duplicate module requests during boot.
    It also shows an average module memory size of 167.10 KiB and an an
    average module .text + .init.text size of 61.13 KiB. The end shows all
    modules which were detected as duplicate requests and whether or not
    they failed early after just the first kernel_read*() call or late after
    we've already allocated the private space for the module in
    layout_and_allocate(). A system with module decompression would reveal
    more wasted virtual memory space.

    We should put effort now into identifying the source of these duplicate
    module requests and trimming these down as much possible. Larger systems
    will obviously show much more wasted virtual memory allocations.

    root@kmod ~ # cat /sys/kernel/debug/modules/stats
             Mods ever loaded       67
         Mods failed on kread       0
    Mods failed on decompress       0
      Mods failed on becoming       83
          Mods failed on load       16
            Total module size       11464704
          Total mod text size       4194304
           Failed kread bytes       0
      Failed decompress bytes       0
        Failed becoming bytes       228959096
            Failed kmod bytes       8578080
     Virtual mem wasted bytes       237537176
             Average mod size       171115
        Average mod text size       62602
      Avg fail becoming bytes       2758544
      Average fail load bytes       536130
    Duplicate failed modules:
                  module-name        How-many-times                    Reason
                    kvm_intel                     7                  Becoming
                          kvm                     7                  Becoming
                    irqbypass                     6           Becoming & Load
             crct10dif_pclmul                     7           Becoming & Load
          ghash_clmulni_intel                     7           Becoming & Load
                 sha512_ssse3                     6           Becoming & Load
               sha512_generic                     7           Becoming & Load
                  aesni_intel                     7                  Becoming
                  crypto_simd                     7           Becoming & Load
                       cryptd                     3           Becoming & Load
                        evdev                     1                  Becoming
                    serio_raw                     1                  Becoming
                         nvme                     3                  Becoming
                    nvme_core                     3                  Becoming
                       t10_pi                     3                  Becoming
                   virtio_pci                     3                  Becoming
                 crc32_pclmul                     6           Becoming & Load
               crc64_rocksoft                     3                  Becoming
                 crc32c_intel                     3                  Becoming
        virtio_pci_modern_dev                     2                  Becoming
        virtio_pci_legacy_dev                     1                  Becoming
                        crc64                     2                  Becoming
                       virtio                     2                  Becoming
                  virtio_ring                     2                  Becoming

    [0] https://github.com/ColinIanKing/stress-ng.git
    [1] echo 0 > /proc/sys/vm/oom_dump_tasks
        ./stress-ng --module 100 --module-name xfs

    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:25 -04:00
Donald Dutile bee0352179 module/decompress: Never use kunmap() for local un-mappings
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 3c17655ab13704582fe25e8ea3200a9b2f8bf20a
Author: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Date:   Wed Mar 15 13:52:56 2023 +0100

    module/decompress: Never use kunmap() for local un-mappings

    Use kunmap_local() to unmap pages locally mapped with kmap_local_page().

    kunmap_local() must be called on the kernel virtual address returned by
    kmap_local_page(), differently from how we use kunmap() which instead
    expects the mapped page as its argument.

    In module_zstd_decompress() we currently map with kmap_local_page() and
    unmap with kunmap(). This breaks the code and so it should be fixed.

    Cc: Piotr Gorski <piotrgorski@cachyos.org>
    Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
    Cc: Luis Chamberlain <mcgrof@kernel.org>
    Cc: Stephen Boyd <swboyd@chromium.org>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Fixes: 169a58ad824d ("module/decompress: Support zstd in-kernel decompression")
    Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
    Reviewed-by: Stephen Boyd <swboyd@chromium.org>
    Reviewed-by: Ira Weiny <ira.weiny@intel.com>
    Reviewed-by: Piotr Gorski <piotrgorski@cachyos.org>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:23 -04:00
Donald Dutile c966289643 module/decompress: Support zstd in-kernel decompression
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 169a58ad824d896b9e291a27193342616e651b82
Author: Stephen Boyd <swboyd@chromium.org>
Date:   Tue Dec 6 13:53:18 2022 -0800

    module/decompress: Support zstd in-kernel decompression

    Add support for zstd compressed modules to the in-kernel decompression
    code. This allows zstd compressed modules to be decompressed by the
    kernel, similar to the existing support for gzip and xz compressed
    modules.

    Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
    Cc: Piotr Gorski <lucjan.lucjanov@gmail.com>
    Cc: Nick Terrell <terrelln@fb.com>
    Signed-off-by: Stephen Boyd <swboyd@chromium.org>
    Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
    Reviewed-by: Piotr Gorski <lucjan.lucjanov@gmail.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:22 -04:00
Donald Dutile a38de3863c module: Fix NULL vs IS_ERR checking for module_get_next_page
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 45af1d7aae7d5520d2858f8517a1342646f015db
Author: Miaoqian Lin <linmq006@gmail.com>
Date:   Thu Nov 10 06:58:34 2022 +0400

    module: Fix NULL vs IS_ERR checking for module_get_next_page

    The module_get_next_page() function return error pointers on error
    instead of NULL.
    Use IS_ERR() to check the return value to fix this.

    Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing")
    Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
    Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:21 -04:00
Donald Dutile 95330b81e8 module/decompress: generate sysfs string at compile time
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 77d6354bd422c8a451ef7d2235322dbf33e7427b
Author: David Disseldorp <ddiss@suse.de>
Date:   Tue Sep 6 10:03:18 2022 +0200

    module/decompress: generate sysfs string at compile time

    compression_show() before (with noinline):
       0xffffffff810b5ff0 <+0>:     mov    %rdx,%rdi
       0xffffffff810b5ff3 <+3>:     mov    $0xffffffff81b55629,%rsi
       0xffffffff810b5ffa <+10>:    mov    $0xffffffff81b0cde2,%rdx
       0xffffffff810b6001 <+17>:    call   0xffffffff811b8fd0 <sysfs_emit>
       0xffffffff810b6006 <+22>:    cltq
       0xffffffff810b6008 <+24>:    ret

    After:
       0xffffffff810b5ff0 <+0>:     mov    $0xffffffff81b0cde2,%rsi
       0xffffffff810b5ff7 <+7>:     mov    %rdx,%rdi
       0xffffffff810b5ffa <+10>:    call   0xffffffff811b8fd0 <sysfs_emit>
       0xffffffff810b5fff <+15>:    cltq
       0xffffffff810b6001 <+17>:    ret

    Signed-off-by: David Disseldorp <ddiss@suse.de>
    Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
    Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:20 -04:00
Donald Dutile 249edb55d1 module: Replace kmap() with kmap_local_page()
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 554694ba120b87e39cf732ed632e6a0c52fafb7c
Author: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Date:   Wed Jul 20 18:19:32 2022 +0200

    module: Replace kmap() with kmap_local_page()

    kmap() is being deprecated in favor of kmap_local_page().

    Two main problems with kmap(): (1) It comes with an overhead as mapping
    space is restricted and protected by a global lock for synchronization and
    (2) it also requires global TLB invalidation when the kmap’s pool wraps
    and it might block when the mapping space is fully utilized until a slot
    becomes available.

    With kmap_local_page() the mappings are per thread, CPU local, can take
    page faults, and can be called from any context (including interrupts).
    Tasks can be preempted and, when scheduled to run again, the kernel
    virtual addresses are restored and still valid.

    kmap_local_page() is faster than kmap() in kernels with HIGHMEM enabled.

    Since the use of kmap_local_page() in module_gzip_decompress() and in
    module_xz_decompress() is safe (i.e., it does not break the strict rules
    of use), it should be preferred over kmap().

    Therefore, replace kmap() with kmap_local_page().

    Tested on a QEMU/KVM x86_32 VM with 4GB RAM, booting kernels with
    HIGHMEM64GB enabled. Modules compressed with XZ or GZIP decompress
    properly.

    Cc: Matthew Wilcox <willy@infradead.com>
    Suggested-by: Ira Weiny <ira.weiny@intel.com>
    Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:18 -04:00
Donald Dutile 4469abdac2 module: Make internal.h and decompress.c more compliant
JIRA: https://issues.redhat.com/browse/RHEL-28063

commit 5aff4dfdb4ae2741cfff759d917f597f2c7f70aa
Author: Aaron Tomlin <atomlin@redhat.com>
Date:   Tue Mar 22 14:03:33 2022 +0000

    module: Make internal.h and decompress.c more compliant

    This patch will address the following warning and style violations
    generated by ./scripts/checkpatch.pl in strict mode:

      WARNING: Use #include <linux/module.h> instead of <asm/module.h>
      #10: FILE: kernel/module/internal.h:10:
      +#include <asm/module.h>

      CHECK: spaces preferred around that '-' (ctx:VxV)
      #18: FILE: kernel/module/internal.h:18:
      +#define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))

      CHECK: Please use a blank line after function/struct/union/enum declarations
      #69: FILE: kernel/module/internal.h:69:
      +}
      +static inline void module_decompress_cleanup(struct load_info *info)
                                                       ^
      CHECK: extern prototypes should be avoided in .h files
      #84: FILE: kernel/module/internal.h:84:
      +extern int mod_verify_sig(const void *mod, struct load_info *info);

      WARNING: Missing a blank line after declarations
      #116: FILE: kernel/module/decompress.c:116:
      +               struct page *page = module_get_next_page(info);
      +               if (!page) {

      WARNING: Missing a blank line after declarations
      #174: FILE: kernel/module/decompress.c:174:
      +               struct page *page = module_get_next_page(info);
      +               if (!page) {

      CHECK: Please use a blank line after function/struct/union/enum declarations
      #258: FILE: kernel/module/decompress.c:258:
      +}
      +static struct kobj_attribute module_compression_attr = __ATTR_RO(compression);

    Note: Fortunately, the multiple-include optimisation found in
    include/linux/module.h will prevent duplication/or inclusion more than
    once.

    Fixes: f314dfea16 ("modsign: log module name in the event of an error")
    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
    Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:12 -04:00
Donald Dutile 9aee9de53d module: Move all into module/
JIRA: https://issues.redhat.com/browse/RHEL-28063

Conflicts:
  RHEL9 applied 12 patches to kernel/modules, several that
  had origins after this upstream commit, and modified to
  use existing kernel/module*.c files.  Thus, the source
  removes differ with those commits, and the additions/move
  to kernel/module/*.c mirrors those removals, preserving the
  same functionality.
  The RHEL9 applied patches show up in further backport
  conflicts that are listed on a per-commit-backport basis
  because of this out-of-order patch application.

commit cfc1d277891eb499b3b5354df33b30f598683e90
Author: Aaron Tomlin <atomlin@redhat.com>
Date:   Tue Mar 22 14:03:31 2022 +0000

    module: Move all into module/

    No functional changes.

    This patch moves all module related code into a separate directory,
    modifies each file name and creates a new Makefile. Note: this effort
    is in preparation to refactor core module code.

    Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
    Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
    Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Signed-off-by: Donald Dutile <ddutile@redhat.com>
2024-06-17 14:17:12 -04:00