Commit Graph

1108 Commits

Author SHA1 Message Date
Paolo Bonzini 2e46f803bb KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values
JIRA: https://issues.redhat.com/browse/RHEL-16745

Add support to MMU caches for initializing a page with a custom 64-bit
value, e.g. to pre-fill an entire page table with non-zero PTE values.
The functionality will be used by x86 to support Intel's TDX, which needs
to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
faults from getting reflected into the guest (Intel's EPT Violation #VE
architecture made the less than brilliant decision of having the per-PTE
behavior be opt-out instead of opt-in).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5919f685f109a1b0ebc6bd8fc4536ee94bcc172d.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c23e2b7103090b05e4d567d8976f99926ea855e9)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-01 09:02:49 +02:00
Paolo Bonzini 646a408a35 KVM: delete .change_pte MMU notifier callback
JIRA: https://issues.redhat.com/browse/RHEL-16745

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f3b65bbaed7c43d10989380d4b95e2a3e9fe5a6b)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Conflicts: no loongarch or RISC-V
2024-07-01 09:02:49 +02:00
Maxim Levitsky bb49195a43 KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit eefb85b3f0310c2f4149c50cb9b13094ed1dde25
Author: Sean Christopherson <seanjc@google.com>
Date:   Mon Mar 4 16:37:42 2024 -0800

    KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()

    Remove gfn_to_pfn_cache_invalidate_start()'s unused @may_block parameter,
    which was leftover from KVM's abandoned (for now) attempt to support guest
    usage of gfn_to_pfn caches.

    Fixes: a4bff3df5147 ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage")
    Reported-by: Like Xu <like.xu.linux@gmail.com>
    Cc: Paul Durrant <paul@xen.org>
    Cc: David Woodhouse <dwmw2@infradead.org>
    Reviewed-by: Paul Durrant <paul@xen.org>
    Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
    Link: https://lore.kernel.org/r/20240305003742.245767-1-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 23:58:51 -04:00
Maxim Levitsky 5f19e5d0d7 KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 284851ee5caef1b42b513752bf1642ce4570bdc1
Author: Oliver Upton <oliver.upton@linux.dev>
Date:   Fri Feb 16 15:59:41 2024 +0000

    KVM: Get rid of return value from kvm_arch_create_vm_debugfs()

    The general expectation with debugfs is that any initialization failure
    is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
    implementations to return an error and kvm_create_vm_debugfs() allows
    that to fail VM creation.

    Change to a void return to discourage architectures from making debugfs
    failures fatal for the VM. Seems like everyone already had the right
    idea, as all implementations already return 0 unconditionally.

    Acked-by: Marc Zyngier <maz@kernel.org>
    Acked-by: Paolo Bonzini <pbonzini@redhat.com>
    Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev
    Signed-off-by: Oliver Upton <oliver.upton@linux.dev>


Conflicts:
   - missing commit faf01aef0570757bfbf1d655e984742c1dd38068
      KVM: PPC: Merge powerpc's debugfs entry content into generic entry

   - out of order backport of 77bcd9e6231a5297ef417a7d7f734d61c2bcceb6
     KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:36 -04:00
Maxim Levitsky 69fb7506ee KVM: fix kvm_mmu_memory_cache allocation warning
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit ea3689d9df50c283cb5d647a74aa45e2cc3f8064
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Mon Feb 12 12:24:10 2024 +0100

    KVM: fix kvm_mmu_memory_cache allocation warning

    gcc-14 notices that the arguments to kvmalloc_array() are mixed up:

    arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_mmu_topup_memory_cache':
    arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: error: 'kvmalloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args]
      424 |                 mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
          |                                                     ^~~~
    arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: note: earlier argument should specify number of elements, later size of each element

    The code still works correctly, but the incorrect order prevents the compiler
    from properly tracking the object sizes.

    Fixes: 837f66c71207 ("KVM: Allow for different capacities in kvm_mmu_memory_cache structs")
    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Reviewed-by: Marc Zyngier <maz@kernel.org>
    Link: https://lore.kernel.org/r/20240212112419.1186065-1-arnd@kernel.org
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky 38e17133d5 KVM: Add a comment explaining the directed yield pending interrupt logic
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit dafc17dd529a6194e199b837916062090562ff80
Author: Sean Christopherson <seanjc@google.com>
Date:   Tue Jan 9 16:39:38 2024 -0800

    KVM: Add a comment explaining the directed yield pending interrupt logic

    Add a comment to explain why KVM treats vCPUs with pending interrupts as
    in-kernel when a vCPU wants to yield to a vCPU that was preempted while
    running in kernel mode.

    Link: https://lore.kernel.org/r/20240110003938.490206-5-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky 6822e9fe71 KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 77bcd9e6231a5297ef417a7d7f734d61c2bcceb6
Author: Sean Christopherson <seanjc@google.com>
Date:   Tue Jan 9 16:39:35 2024 -0800

    KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel

    Plumb in a dedicated hook for querying whether or not a vCPU was preempted
    in-kernel.  Unlike literally every other architecture, x86's VMX can check
    if a vCPU is in kernel context if and only if the vCPU is loaded on the
    current pCPU.

    x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying
    kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel
    as needed.  But that's unnecessary, confusing, and fragile, e.g. x86 has
    had at least one bug where KVM incorrectly used a stale
    preempted_in_kernel.

    No functional change intended.

    Reviewed-by: Yuan Yao <yuan.yao@intel.com>
    Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky a8beb29bcc kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 8886640dade4ae2595fcdce511c8bcc716aa47d3
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Thu Jan 11 03:00:34 2024 -0500

    kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol

    KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask
    unused definitions in include/uapi/linux/kvm.h.  __KVM_HAVE_READONLY_MEM however
    was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on
    architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly
    return nonzero.  This however does not make sense, and it prevented userspace
    from supporting this architecture-independent feature without recompilation.

    Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and
    is only used in virt/kvm/kvm_main.c.  Userspace does not need to test it
    and there should be no need for it to exist.  Remove it and replace it
    with a Kconfig symbol within Linux source code.

    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>


Conflicts:
   - no support for RISC-V
   - no support for loongarch
   - out of order backport e563592224e02f87048edee3ce3f0da16cceee88
     KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY commit

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky fdd92172c0 KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit d489ec95658392a000dd26fba511eec1900245b0
Author: Sean Christopherson <seanjc@google.com>
Date:   Tue Jan 9 16:42:39 2024 -0800

    KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls

    When handling the end of an mmu_notifier invalidation, WARN if
    mn_active_invalidate_count is already 0 do not decrement it further, i.e.
    avoid causing mn_active_invalidate_count to underflow/wrap.  In the worst
    case scenario, effectively corrupting mn_active_invalidate_count could
    cause kvm_swap_active_memslots() to hang indefinitely.

    end() calls are *supposed* to be paired with start(), i.e. underflow can
    only happen if there is a bug elsewhere in the kernel, but due to lack of
    lockdep assertions in the mmu_notifier helpers, it's all too easy for a
    bug to go unnoticed for some time, e.g. see the recently introduced
    PAGEMAP_SCAN ioctl().

    Ideally, mmu_notifiers would incorporate lockdep assertions, but users of
    mmu_notifiers aren't required to hold any one specific lock, i.e. adding
    the necessary annotations to make lockdep aware of all locks that are
    mutally exclusive with mm_take_all_locks() isn't trivial.

    Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com
    Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky 7bee92cc98 KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit e563592224e02f87048edee3ce3f0da16cceee88
Author: Sean Christopherson <seanjc@google.com>
Date:   Thu Feb 22 11:06:08 2024 -0800

    KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY

    Disallow creating read-only memslots that support GUEST_MEMFD, as
    GUEST_MEMFD is fundamentally incompatible with KVM's semantics for
    read-only memslots.  Read-only memslots allow the userspace VMM to emulate
    option ROMs by filling the backing memory with readable, executable code
    and data, while triggering emulated MMIO on writes.  GUEST_MEMFD doesn't
    currently support writes from userspace and KVM doesn't support emulated
    MMIO on private accesses, i.e. the guest can only ever read zeros, and
    writes will always be treated as errors.

    Cc: Fuad Tabba <tabba@google.com>
    Cc: Michael Roth <michael.roth@amd.com>
    Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
    Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
    Cc: Chao Peng <chao.p.peng@linux.intel.com>
    Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
    Link: https://lore.kernel.org/r/20240222190612.2942589-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky ac5ad86c20 KVM: remove deprecated UAPIs
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit a5d3df8ae13fada772fbce952e9ee7b3433dba16
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Nov 8 10:34:03 2023 +0100

    KVM: remove deprecated UAPIs

    The deprecated interfaces were removed 15 years ago.  KVM's
    device assignment was deprecated in 4.2 and removed 6.5 years
    ago; the only interest might be in compiling ancient versions
    of QEMU, but QEMU has been using its own imported copy of the
    kernel headers since June 2011.  So again we go into archaeology
    territory; just remove the cruft.

    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky fedcd1aa15 KVM: remove CONFIG_HAVE_KVM_IRQFD
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit c5b31cc2371728ddefe9baf1d036aeb630a25d96
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Wed Oct 18 12:07:32 2023 -0400

    KVM: remove CONFIG_HAVE_KVM_IRQFD

    All platforms with a kernel irqchip have support for irqfd.  Unify the
    two configuration items so that userspace can expect to use irqfd to
    inject interrupts into the irqchip.

    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Conflicts:
    - no support for RISC-V

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky 563463dfc6 KVM: Harden copying of userspace-array against overflow
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 1f829359c8c37f77a340575957686ca8c4bca317
Author: Philipp Stanner <pstanner@redhat.com>
Date:   Thu Nov 2 19:15:26 2023 +0100

    KVM: Harden copying of userspace-array against overflow

    kvm_main.c utilizes vmemdup_user() and array_size() to copy a userspace
    array. Currently, this does not check for an overflow.

    Use the new wrapper vmemdup_array_user() to copy the array more safely.

    Note, KVM explicitly checks the number of entries before duplicating the
    array, i.e. adding the overflow check should be a glorified nop.

    Suggested-by: Dave Airlie <airlied@redhat.com>
    Signed-off-by: Philipp Stanner <pstanner@redhat.com>
    Link: https://lore.kernel.org/r/20231102181526.43279-4-pstanner@redhat.com
    [sean: call out that KVM pre-checks the number of entries]
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:38:35 -04:00
Maxim Levitsky 67e57ed362 KVM: move KVM_CAP_DEVICE_CTRL to the generic check
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 63912245c19d3a4179da44beefd017eb9270f207
Author: Wei Wang <wei.w.wang@intel.com>
Date:   Wed Mar 15 18:16:06 2023 +0800

    KVM: move KVM_CAP_DEVICE_CTRL to the generic check

    KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device
    framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move
    KVM_CAP_DEVICE_CTRL to the generic check for the two reasons:
    1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO).
    For example, userspace VFIO implementation may needs to create
    KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it
    checked at the generic code than at each arch's code.
    2) KVM_CREATE_DEVICE has been added to the generic code.

    Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.com
    Signed-off-by: Wei Wang <wei.w.wang@intel.com>
    Reviewed-by: Sean Christopherson <seanjc@google.com>
    Acked-by: Anup Patel <anup@brainfault.org> (riscv)
    Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
    Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
    Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>


Conflicts:
    no support for RISC-V

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:36:42 -04:00
Maxim Levitsky 4c5041073d KVM: Convert comment into an assertion in kvm_io_bus_register_dev()
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit b1a39a718db44ecb18c2a99a11e15f6eedc14c53
Author: Marc Zyngier <maz@kernel.org>
Date:   Thu Dec 7 15:12:01 2023 +0000

    KVM: Convert comment into an assertion in kvm_io_bus_register_dev()

    Instead of having a comment indicating the need to hold slots_lock
    when calling kvm_io_bus_register_dev(), make it explicit with
    a lockdep assertion.

    Signed-off-by: Marc Zyngier <maz@kernel.org>
    Link: https://lore.kernel.org/r/20231207151201.3028710-6-maz@kernel.org
    Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:36:24 -04:00
Maxim Levitsky a8799262fe Revert "KVM: Prevent module exit until all VMs are freed"
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit ea61294befd361ab8260c65d53987b400e5599a7
Author: Sean Christopherson <seanjc@google.com>
Date:   Wed Oct 18 13:46:24 2023 -0700

    Revert "KVM: Prevent module exit until all VMs are freed"

    Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
    was actually due to failure to flush a workqueue, not a lack of module
    refcounting.  Pinning the KVM module until kvm_vm_destroy() doesn't
    prevent use-after-free due to the module being unloaded, as userspace can
    invoke delete_module() the instant the last reference to KVM is put, i.e.
    can cause all KVM code to be unmapped while KVM is actively executing said
    code.

    Generally speaking, the many instances of module_put(THIS_MODULE)
    notwithstanding, outside of a few special paths, a module can never safely
    put the last reference to itself without creating deadlock, i.e. something
    external to the module *must* put the last reference.  In other words,
    having VMs grab a reference to the KVM module is futile, pointless, and as
    evidenced by the now-reverted commit 70375c2d8fa3 ("Revert "KVM: set owner
    of cpu and vm file operations""), actively dangerous.

    This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit
    5f6de5cbebee925a612856fce6f9182bb3eee0db.

    Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
    Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
    Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:36:21 -04:00
Maxim Levitsky a55fac8f9d KVM: Set file_operations.owner appropriately for all such structures
JIRA: https://issues.redhat.com/browse/RHEL-32430

commit 087e15206d6ac0d46734e2b0ab34370c0fdca481
Author: Sean Christopherson <seanjc@google.com>
Date:   Wed Oct 18 13:46:22 2023 -0700

    KVM: Set file_operations.owner appropriately for all such structures

    Set .owner for all KVM-owned filed types so that the KVM module is pinned
    until any files with callbacks back into KVM are completely freed.  Using
    "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
    while there are active VMs, doesn't provide full protection.

    Userspace can invoke delete_module() the instant the last reference to KVM
    is put.  If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
    then it's possible for KVM to be preempted and deleted/unloaded before KVM
    fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
    in, it will jump to a code page that is no longer mapped.

    Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
    kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
    THIS_MODULE (which points at kvm.ko).  KVM assumes that if /dev/kvm is
    reachable, e.g. VMs are active, then the vendor module is loaded.

    To reduce the probability of forgetting to set .owner entirely, use
    THIS_MODULE for stats files where KVM does not call back into vendor code.

    This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
    several other file types that have been buggy since their introduction.

    Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
    Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
    Reported-by: Al Viro <viro@zeniv.linux.org.uk>
    Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
    Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2024-05-13 18:36:21 -04:00
Paolo Bonzini 1f6ace35d4 KVM: Allow arch code to track number of memslot address spaces per VM
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit eed52e434bc33603ddb0af62b6c4ef818948489d)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:46 +01:00
Paolo Bonzini f5219db0c0 KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit a7800aa80ea4d5356b8474c2302812e9d4926fa6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:46 +01:00
Paolo Bonzini a1e93a70a6 KVM: Introduce per-page memory attributes
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 5a475554db1e476a14216e742ea2bdb77362d5d5)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:45 +01:00
Paolo Bonzini a05c523304 KVM: Drop .on_unlock() mmu_notifier hook
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 193bbfaacc84f9ee9c281ec0a8dd2ec8e4821e57)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:45 +01:00
Paolo Bonzini 9bdf2ec011 KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit cec29eef0a815386d520d61c2cbe16d537931639)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:45 +01:00
Paolo Bonzini ff5dd1c30e KVM: Introduce KVM_SET_USER_MEMORY_REGION2
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-9-seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit bb58b90b1a8f753b582055adaf448214a8e22c31)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL: context differences due to KVM/ARM 6.6 not having been backported yet]
2023-12-01 14:51:45 +01:00
Paolo Bonzini e1cb1f3f0e KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f128cf8cfbecccf95e891ae90d9c917df5117c7a)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

[RHEL: no loongarch/riscv, conflict differences for MIPS]
2023-12-01 14:51:45 +01:00
Paolo Bonzini ce59ee4fc7 KVM: WARN if there are dangling MMU invalidations at VM destruction
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit d497a0fab8b8457214fcc9b1a39530920ea7e95e)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:44 +01:00
Paolo Bonzini 36c68ae77f KVM: Use gfn instead of hva for mmu_notifier_retry
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Message-Id: <20231027182217.3615211-4-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 8569992d64b8f750e34b7858eac5d7daaf0f80fd)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:44 +01:00
Paolo Bonzini c4e213c9a2 KVM: Assert that mmu_invalidate_in_progress *never* goes negative
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-3-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c0db19232c1ed6bd7fcb825c28b014c52732c19e)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:44 +01:00
Paolo Bonzini e7599a55b2 KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges
JIRA: https://issues.redhat.com/browse/RHEL-14702

Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.  Rename the
handler typedef too (arguably it should always have been gfn_handler_t).

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-2-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit e97b39c5c4362dc1cbc37a563ddac313b96c84f3)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:44 +01:00
Paolo Bonzini 3b1a10cee0 KVM: Wrap kvm_{gfn,hva}_range.pte in a per-action union
JIRA: https://issues.redhat.com/browse/RHEL-14702

Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can
pass event specific information up and down the stack without needing to
constantly expand and churn the APIs.  Lockless aging of SPTEs will pass
around a bitmap, and support for memory attributes will pass around the
new attributes for the range.

Add a "KVM_NO_ARG" placeholder to simplify handling events without an
argument (creating a dummy union variable is midly annoying).

Opportunstically drop explicit zero-initialization of the "pte" field, as
omitting the field (now a union) has the same effect.

Cc: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.com
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 3e1efe2b67d3d38116ec010968dbcd89d29e4561)

[RHEL-only: no RISC-V]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:15 +01:00
Paolo Bonzini a0b490964d KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common code
JIRA: https://issues.redhat.com/browse/RHEL-14702

Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop
"arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a
range-based TLB invalidation where the range is defined by the memslot.
Now that kvm_flush_remote_tlbs_range() can be called from common code we
can just use that and drop a bunch of duplicate code from the arch
directories.

Note this adds a lockdep assertion for slots_lock being held when
calling kvm_flush_remote_tlbs_memslot(), which was previously only
asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(),
but they all hold the slots_lock, so the lockdep assertion continues to
hold true.

Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating
kvm_flush_remote_tlbs_memslot(), since it is no longer necessary.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com
(cherry picked from commit 619b5072443c05cf18c31b2c0320cdb42396d411)

[RHEL-only: no RISC-V]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:15 +01:00
Paolo Bonzini 29037d4964 KVM: Allow range-based TLB invalidation from common code
JIRA: https://issues.redhat.com/browse/RHEL-14702

Make kvm_flush_remote_tlbs_range() visible in common code and create a
default implementation that just invalidates the whole TLB.

This paves the way for several future features/cleanups:

 - Introduction of range-based TLBI on ARM.
 - Eliminating kvm_arch_flush_remote_tlbs_memslot()
 - Moving the KVM/x86 TDP MMU to common code.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
(cherry picked from commit d4788996051e3c07fadc6d9b214073fcf78810a8)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:15 +01:00
Paolo Bonzini 6e6233932e KVM: Remove CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
JIRA: https://issues.redhat.com/browse/RHEL-14702

kvm_arch_flush_remote_tlbs() or CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
are two mechanisms to solve the same problem, allowing
architecture-specific code to provide a non-IPI implementation of
remote TLB flushing.

Dropping CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL allows KVM to standardize
all architectures on kvm_arch_flush_remote_tlbs() instead of
maintaining two mechanisms.

Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-5-rananta@google.com
(cherry picked from commit eddd21481011008792f4e647a5244f6e15970abc)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:15 +01:00
Paolo Bonzini 9e3102d4a9 KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs()
JIRA: https://issues.redhat.com/browse/RHEL-14702

Rename kvm_arch_flush_remote_tlb() and the associated macro
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively.

Making the name plural matches kvm_flush_remote_tlbs() and makes it more
clear that this function can affect more than one remote TLB.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
(cherry picked from commit a1342c8027288e345cc5fd16c6800f9d4eb788ed)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:14 +01:00
Paolo Bonzini 8a53b1f6ac KVM: destruct kvm_io_device while unregistering it from kvm_io_bus
JIRA: https://issues.redhat.com/browse/RHEL-14702

Current usage of kvm_io_device requires users to destruct it with an extra
call of kvm_iodevice_destructor after the device gets unregistered from
kvm_io_bus. This is not necessary and can cause errors if a user forgot
to make the extra call.

Simplify the usage by combining kvm_iodevice_destructor into
kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can
avoid the leakage of destructing the device explicitly.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 5ea5ca3c2b4bf4090232e18cfc515dcb52f914a6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:08 +01:00
Paolo Bonzini 41f13691f0 KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page
JIRA: https://issues.redhat.com/browse/RHEL-14702

Now that KVM honors past and in-progress mmu_notifier invalidations when
reloading the APIC-access page, use KVM's "standard" invalidation hooks
to trigger a reload and delete the one-off usage of invalidate_range().

Aside from eliminating one-off code in KVM, dropping KVM's use of
invalidate_range() will allow common mmu_notifier to redefine the API to
be more strictly focused on invalidating secondary TLBs that share the
primary MMU's page tables.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 0a8a5f2c8c266e9d94fb45f76a26cff135d0051c)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:51:07 +01:00
Paolo Bonzini 3e87dc3876 KVM: Standardize on "int" return types instead of "long" in kvm_main.c
JIRA: https://issues.redhat.com/browse/RHEL-14702

KVM functions use "long" return values for functions that are wired up
to "struct file_operations", but otherwise use "int" return values for
functions that can return 0/-errno in order to avoid unintentional
divergences between 32-bit and 64-bit kernels.
Some code still uses "long" in unnecessary spots, though, which can
cause a little bit of confusion and unnecessary size casts. Let's
change these spots to use "int" types, too.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-6-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f15ba52bfabc3bc130053bd73d414d859162de91)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01 14:50:59 +01:00
Scott Weaver ee915e2ac8 Merge: KVM: aarch64: Rebase up to v6.5 (first round)
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3111

JIRA: https://issues.redhat.com/browse/RHEL-1760
Upstream Status: v6.5
Tested: Passed on virt-install, kvm/selftests, kvm-unit-tests, migration, pci-passthrough

This is the first round to rebase kvm-arm to upstream v6.5, which just pick up those independent commit. This merge request contains several parts:
1. Eager page splitting
2. Support writable cpuid reg from userspace
3. misc stuff

And the omitted things in the first round include:
1. FF-A proxy for pKVM
2. Support for Armv8.8 memcpy instructions in userspace
3. Permission Indirection Extension
4. Allow using VHE in the nVHE hypervisor
5. Some Rename Register per auto-gen tools
6. Fix setting SVE and SME traps in hVHE

Signed-off-by: Shaoqin Huang <shahuang@redhat.com>

Approved-by: Sebastian Ott <sebott@redhat.com>
Approved-by: Cornelia Huck <cohuck@redhat.com>
Approved-by: Gavin Shan <gshan@redhat.com>
Approved-by: Eric Auger <eric.auger@redhat.com>

Signed-off-by: Scott Weaver <scweaver@redhat.com>
2023-10-11 13:28:06 -04:00
Shaoqin Huang 951be3be18 KVM: arm64: Export kvm_are_all_memslots_empty()
JIRA: https://issues.redhat.com/browse/RHEL-1760

commit 26f457142d7ee2da20a5b701862230e4961423d9
Author: Ricardo Koller <ricarkol@google.com>
Date:   Wed Apr 26 17:23:22 2023 +0000

    KVM: arm64: Export kvm_are_all_memslots_empty()

    Export kvm_are_all_memslots_empty(). This will be used by a future
    commit when checking before setting a capability.

    Signed-off-by: Ricardo Koller <ricarkol@google.com>
    Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
    Reviewed-by: Gavin Shan <gshan@redhat.com>
    Link: https://lore.kernel.org/r/20230426172330.1439644-5-ricarkol@google.com
    Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
2023-09-25 03:21:51 -04:00
Jerome Marchand 9032c2fcb7 treewide: Trace IPIs sent via smp_send_reschedule()
Bugzilla: https://bugzilla.redhat.com/2192613

Conflicts:
Missing loongarch and context changes from missing commits
- 9b932aadfc47 ("riscv: kexec: Fixup crash_smp_send_stop without multi
cores")
- aabcaf6ae2a0 ("KVM: PPC: Book3S HV P9: Move host OS save/restore
functions to built-in")

commit 4c8c3c7f70a6779d30f5492acbc9978f4636fe7a
Author: Valentin Schneider <vschneid@redhat.com>
Date:   Tue Mar 7 14:35:56 2023 +0000

    treewide: Trace IPIs sent via smp_send_reschedule()

    To be able to trace invocations of smp_send_reschedule(), rename the
    arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
    into an smp_send_reschedule() that contains a tracepoint.

    Changes to include the declaration of the tracepoint were driven by the
    following coccinelle script:

      @func_use@
      @@
      smp_send_reschedule(...);

      @include@
      @@
      #include <trace/events/ipi.h>

      @no_include depends on func_use && !include@
      @@
        #include <...>
      +
      + #include <trace/events/ipi.h>

    [csky bits]
    [riscv bits]
    Signed-off-by: Valentin Schneider <vschneid@redhat.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Guo Ren <guoren@kernel.org>
    Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
    Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-09-14 15:36:30 +02:00
Maxim Levitsky 78354ab71a KVM: Grab a reference to KVM for VM and vCPU stats file descriptors
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079

commit eed3013faa401aae662398709410a59bb0646e32
Author: Sean Christopherson <seanjc@google.com>
Date:   Tue Jul 11 16:01:25 2023 -0700

    KVM: Grab a reference to KVM for VM and vCPU stats file descriptors

    Grab a reference to KVM prior to installing VM and vCPU stats file
    descriptors to ensure the underlying VM and vCPU objects are not freed
    until the last reference to any and all stats fds are dropped.

    Note, the stats paths manually invoke fd_install() and so don't need to
    grab a reference before creating the file.

    Fixes: ce55c04945 ("KVM: stats: Support binary stats retrieval for a VCPU")
    Fixes: fcfe1baedd ("KVM: stats: Support binary stats retrieval for a VM")
    Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu>
    Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com
    Cc: stable@vger.kernel.org
    Cc: Kees Cook <keescook@chromium.org>
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Message-Id: <20230711230131.648752-2-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2023-08-07 17:38:06 +03:00
Maxim Levitsky ae98174d3d KVM: Clean up kvm_vm_ioctl_create_vcpu()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079

commit 5f643e460ab1298a32b7d0db104bfcab9d6165c0
Author: Michal Luczaj <mhal@rbox.co>
Date:   Mon Jun 5 13:44:19 2023 +0200

    KVM: Clean up kvm_vm_ioctl_create_vcpu()

    Since c9d601548603 ("KVM: allow KVM_BUG/KVM_BUG_ON to handle 64-bit cond")
    'cond' is internally converted to boolean, so caller's explicit conversion
    from void* is unnecessary.

    Remove the double bang.

    Signed-off-by: Michal Luczaj <mhal@rbox.co>
    Reviewed-by: Yuan Yao <yuan.yao@intel.com>
    base-commit: 76a17bf03a268bc342e08c05d8ddbe607d294eb4
    Link: https://lore.kernel.org/r/20230605114852.288964-1-mhal@rbox.co
    Signed-off-by: Sean Christopherson <seanjc@google.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2023-08-07 17:29:43 +03:00
Maxim Levitsky d86a00354c KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079

commit 52882b9c7a761b2b4e44717d6fbd1ed94c601b7f
Author: Alexey Kardashevskiy <aik@ozlabs.ru>
Date:   Wed May 4 17:48:07 2022 +1000

    KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent

    When introduced, IRQFD resampling worked on POWER8 with XICS. However
    KVM on POWER9 has never implemented it - the compatibility mode code
    ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
    XIVE mode does not handle INTx in KVM at all.

    This moved the capability support advertising to platforms and stops
    advertising it on XIVE, i.e. POWER9 and later.

    Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
    Acked-by: Anup Patel <anup@brainfault.org>
    Acked-by: Nicholas Piggin <npiggin@gmail.com>
    Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2023-08-07 15:14:07 +03:00
Maxim Levitsky b0dcf528b2 KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079

commit 42a90008f890afc41837dfeec1f0b1e7bcecf94a
Author: David Woodhouse <dwmw@amazon.co.uk>
Date:   Wed Jan 11 18:06:50 2023 +0000

    KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule

    Documentation/virt/kvm/locking.rst tells us that kvm->lock is taken outside
    vcpu->mutex. But that doesn't actually happen very often; it's only in
    some esoteric cases like migration with AMD SEV. This means that lockdep
    usually doesn't notice, and doesn't do its job of keeping us honest.

    Ensure that lockdep *always* knows about the ordering of these two locks,
    by briefly taking vcpu->mutex in kvm_vm_ioctl_create_vcpu() while kvm->lock
    is held.

    Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
    Message-Id: <20230111180651.14394-3-dwmw2@infradead.org>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
2023-08-07 15:14:06 +03:00
Eric Auger 98a1192f5f KVM: Protect vcpu->pid dereference via debugfs with RCU
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

Wrap the vcpu->pid dereference in the debugfs hook vcpu_get_pid() with
proper RCU read (un)lock.  Unlike the code in kvm_vcpu_ioctl(),
vcpu_get_pid() is not a simple access; the pid pointer is passed to
pid_nr() and fully dereferenced if the pointer is non-NULL.

Failure to acquire RCU could result in use-after-free of the old pid if
a different task invokes KVM_RUN and puts the last reference to the old
vcpu->pid between vcpu_get_pid() reading the pointer and dereferencing it
in pid_nr().

Fixes: e36de87d34a7 ("KVM: debugfs: expose pid of vcpu threads")
Link: https://lore.kernel.org/r/20230211010719.982919-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 76021e96d781e1fe8de02ebe52f3eb276716b6b0)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-05 09:58:14 -04:00
Eric Auger b231ee098f kvm: kvm_main: Remove unnecessary (void*) conversions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

void * pointer assignment does not require a forced replacement.

Signed-off-by: Li kunyu <kunyu@nfschina.com>
Link: https://lore.kernel.org/r/20221213080236.3969-1-kunyu@nfschina.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 14aa40a1d05ef27c42fdfacd5a81f5df6a49ee39)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-05 09:58:13 -04:00
Eric Auger f26bf20af6 KVM: Fix comments that refer to the non-existent install_new_memslots()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

Fix stale comments that were left behind when install_new_memslots() was
replaced by kvm_swap_active_memslots() as part of the scalable memslots
rework.

Fixes: a54d806688fe ("KVM: Keep memslots in tree-based structures instead of array-based ones")
Signed-off-by: Jun Miao <jun.miao@intel.com>
Link: https://lore.kernel.org/r/20230223052851.1054799-1-jun.miao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit b0d237087c674c43df76c1a0bc2737592f3038f4)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-05 09:58:13 -04:00
Eric Auger b540375061 KVM: Fix vcpu_array[0] races
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.

When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):

	int num_vcpus = atomic_read(&kvm->online_vcpus);
	i = array_index_nospec(i, num_vcpus);
	return xa_load(&kvm->vcpu_array, i);

Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:

	xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
			  (atomic_read(&kvm->online_vcpus) - 1))

In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().

However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.

This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:

	vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
	r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
	kvm_get_kvm(kvm);
	r = create_vcpu_fd(vcpu);
	if (r < 0) {
		xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
		kvm_put_kvm_no_destroy(kvm);
		goto unlock_vcpu_destroy;
	}
	atomic_inc(&kvm->online_vcpus);

This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit afb2acb2e3a32e4d56f7fbd819769b98ed1b7520)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-04 11:33:02 -04:00
Eric Auger 8595b993fb KVM: Don't enable hardware after a restart/shutdown is initiated
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}().  The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.

Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.

Fixes: 8e1c18157d ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit e0ceec221f62deb5d6c32c0327030028d3db5f27)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-04 11:32:53 -04:00
Eric Auger 8d2ec08b0e KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922

Use syscore_ops.shutdown to disable hardware virtualization during a
reboot instead of using the dedicated reboot_notifier so that KVM disables
virtualization _after_ system_state has been updated.  This will allow
fixing a race in KVM's handling of a forced reboot where KVM can end up
enabling hardware virtualization between kernel_restart_prepare() and
machine_restart().

Rename KVM's hook to match the syscore op to avoid any possible confusion
from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
nor "reboot" is completely accurate as the hook handles both).

Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
specific, and to explain why kvm_rebooting exists.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: kvmarm@lists.linux.dev
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Anup Patel <anup@brainfault.org>
Cc: Atish Patra <atishp@atishpatra.org>
Cc: kvm-riscv@lists.infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 6735150b69978a9f73e3d1bab719e81a5dfafa83)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-04 11:32:45 -04:00
Eric Auger 50146c9cab KVM: Avoid illegal stage2 mapping on invalid memory slot
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217329

We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.

The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().

  CPU-A                    CPU-B
  -----                    -----
                           ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                           kvm_vm_ioctl_set_memory_region
                           kvm_set_memory_region
                           __kvm_set_memory_region
                           kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                             kvm_invalidate_memslot
                               kvm_copy_memslot
                               kvm_replace_memslot
                               kvm_swap_active_memslots        (A)
                               kvm_arch_flush_shadow_memslot   (B)
  same page sharing by KSM
  kvm_mmu_notifier_invalidate_range_start
        :
  kvm_mmu_notifier_change_pte
    kvm_handle_hva_range
    __kvm_handle_hva_range
    kvm_set_spte_gfn            (C)
        :
  kvm_mmu_notifier_invalidate_range_end

Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.

We tried a git-bisect and the first problematic commit is cd4c718352 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35
("KVM: ARM: Memory virtualization setup").

Cc: stable@vger.kernel.org # v3.9+
Fixes: d5d8184d35 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Message-Id: <20230615054259.14911-1-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
2023-07-04 09:14:30 -04:00