JIRA: https://issues.redhat.com/browse/RHEL-16745
Add support to MMU caches for initializing a page with a custom 64-bit
value, e.g. to pre-fill an entire page table with non-zero PTE values.
The functionality will be used by x86 to support Intel's TDX, which needs
to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
faults from getting reflected into the guest (Intel's EPT Violation #VE
architecture made the less than brilliant decision of having the per-PTE
behavior be opt-out instead of opt-in).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5919f685f109a1b0ebc6bd8fc4536ee94bcc172d.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c23e2b7103090b05e4d567d8976f99926ea855e9)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-16745
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f3b65bbaed7c43d10989380d4b95e2a3e9fe5a6b)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts: no loongarch or RISC-V
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit eefb85b3f0310c2f4149c50cb9b13094ed1dde25
Author: Sean Christopherson <seanjc@google.com>
Date: Mon Mar 4 16:37:42 2024 -0800
KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
Remove gfn_to_pfn_cache_invalidate_start()'s unused @may_block parameter,
which was leftover from KVM's abandoned (for now) attempt to support guest
usage of gfn_to_pfn caches.
Fixes: a4bff3df5147 ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage")
Reported-by: Like Xu <like.xu.linux@gmail.com>
Cc: Paul Durrant <paul@xen.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20240305003742.245767-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 284851ee5caef1b42b513752bf1642ce4570bdc1
Author: Oliver Upton <oliver.upton@linux.dev>
Date: Fri Feb 16 15:59:41 2024 +0000
KVM: Get rid of return value from kvm_arch_create_vm_debugfs()
The general expectation with debugfs is that any initialization failure
is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows
implementations to return an error and kvm_create_vm_debugfs() allows
that to fail VM creation.
Change to a void return to discourage architectures from making debugfs
failures fatal for the VM. Seems like everyone already had the right
idea, as all implementations already return 0 unconditionally.
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Conflicts:
- missing commit faf01aef0570757bfbf1d655e984742c1dd38068
KVM: PPC: Merge powerpc's debugfs entry content into generic entry
- out of order backport of 77bcd9e6231a5297ef417a7d7f734d61c2bcceb6
KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit ea3689d9df50c283cb5d647a74aa45e2cc3f8064
Author: Arnd Bergmann <arnd@arndb.de>
Date: Mon Feb 12 12:24:10 2024 +0100
KVM: fix kvm_mmu_memory_cache allocation warning
gcc-14 notices that the arguments to kvmalloc_array() are mixed up:
arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_mmu_topup_memory_cache':
arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: error: 'kvmalloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args]
424 | mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
| ^~~~
arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: note: earlier argument should specify number of elements, later size of each element
The code still works correctly, but the incorrect order prevents the compiler
from properly tracking the object sizes.
Fixes: 837f66c71207 ("KVM: Allow for different capacities in kvm_mmu_memory_cache structs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240212112419.1186065-1-arnd@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit dafc17dd529a6194e199b837916062090562ff80
Author: Sean Christopherson <seanjc@google.com>
Date: Tue Jan 9 16:39:38 2024 -0800
KVM: Add a comment explaining the directed yield pending interrupt logic
Add a comment to explain why KVM treats vCPUs with pending interrupts as
in-kernel when a vCPU wants to yield to a vCPU that was preempted while
running in kernel mode.
Link: https://lore.kernel.org/r/20240110003938.490206-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 77bcd9e6231a5297ef417a7d7f734d61c2bcceb6
Author: Sean Christopherson <seanjc@google.com>
Date: Tue Jan 9 16:39:35 2024 -0800
KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernel
Plumb in a dedicated hook for querying whether or not a vCPU was preempted
in-kernel. Unlike literally every other architecture, x86's VMX can check
if a vCPU is in kernel context if and only if the vCPU is loaded on the
current pCPU.
x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying
kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel
as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has
had at least one bug where KVM incorrectly used a stale
preempted_in_kernel.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 8886640dade4ae2595fcdce511c8bcc716aa47d3
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu Jan 11 03:00:34 2024 -0500
kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask
unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however
was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on
architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly
return nonzero. This however does not make sense, and it prevented userspace
from supporting this architecture-independent feature without recompilation.
Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and
is only used in virt/kvm/kvm_main.c. Userspace does not need to test it
and there should be no need for it to exist. Remove it and replace it
with a Kconfig symbol within Linux source code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts:
- no support for RISC-V
- no support for loongarch
- out of order backport e563592224e02f87048edee3ce3f0da16cceee88
KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY commit
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit d489ec95658392a000dd26fba511eec1900245b0
Author: Sean Christopherson <seanjc@google.com>
Date: Tue Jan 9 16:42:39 2024 -0800
KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls
When handling the end of an mmu_notifier invalidation, WARN if
mn_active_invalidate_count is already 0 do not decrement it further, i.e.
avoid causing mn_active_invalidate_count to underflow/wrap. In the worst
case scenario, effectively corrupting mn_active_invalidate_count could
cause kvm_swap_active_memslots() to hang indefinitely.
end() calls are *supposed* to be paired with start(), i.e. underflow can
only happen if there is a bug elsewhere in the kernel, but due to lack of
lockdep assertions in the mmu_notifier helpers, it's all too easy for a
bug to go unnoticed for some time, e.g. see the recently introduced
PAGEMAP_SCAN ioctl().
Ideally, mmu_notifiers would incorporate lockdep assertions, but users of
mmu_notifiers aren't required to hold any one specific lock, i.e. adding
the necessary annotations to make lockdep aware of all locks that are
mutally exclusive with mm_take_all_locks() isn't trivial.
Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com
Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit e563592224e02f87048edee3ce3f0da16cceee88
Author: Sean Christopherson <seanjc@google.com>
Date: Thu Feb 22 11:06:08 2024 -0800
KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
Disallow creating read-only memslots that support GUEST_MEMFD, as
GUEST_MEMFD is fundamentally incompatible with KVM's semantics for
read-only memslots. Read-only memslots allow the userspace VMM to emulate
option ROMs by filling the backing memory with readable, executable code
and data, while triggering emulated MMIO on writes. GUEST_MEMFD doesn't
currently support writes from userspace and KVM doesn't support emulated
MMIO on private accesses, i.e. the guest can only ever read zeros, and
writes will always be treated as errors.
Cc: Fuad Tabba <tabba@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Link: https://lore.kernel.org/r/20240222190612.2942589-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit a5d3df8ae13fada772fbce952e9ee7b3433dba16
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed Nov 8 10:34:03 2023 +0100
KVM: remove deprecated UAPIs
The deprecated interfaces were removed 15 years ago. KVM's
device assignment was deprecated in 4.2 and removed 6.5 years
ago; the only interest might be in compiling ancient versions
of QEMU, but QEMU has been using its own imported copy of the
kernel headers since June 2011. So again we go into archaeology
territory; just remove the cruft.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit c5b31cc2371728ddefe9baf1d036aeb630a25d96
Author: Paolo Bonzini <pbonzini@redhat.com>
Date: Wed Oct 18 12:07:32 2023 -0400
KVM: remove CONFIG_HAVE_KVM_IRQFD
All platforms with a kernel irqchip have support for irqfd. Unify the
two configuration items so that userspace can expect to use irqfd to
inject interrupts into the irqchip.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts:
- no support for RISC-V
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 1f829359c8c37f77a340575957686ca8c4bca317
Author: Philipp Stanner <pstanner@redhat.com>
Date: Thu Nov 2 19:15:26 2023 +0100
KVM: Harden copying of userspace-array against overflow
kvm_main.c utilizes vmemdup_user() and array_size() to copy a userspace
array. Currently, this does not check for an overflow.
Use the new wrapper vmemdup_array_user() to copy the array more safely.
Note, KVM explicitly checks the number of entries before duplicating the
array, i.e. adding the overflow check should be a glorified nop.
Suggested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20231102181526.43279-4-pstanner@redhat.com
[sean: call out that KVM pre-checks the number of entries]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 63912245c19d3a4179da44beefd017eb9270f207
Author: Wei Wang <wei.w.wang@intel.com>
Date: Wed Mar 15 18:16:06 2023 +0800
KVM: move KVM_CAP_DEVICE_CTRL to the generic check
KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device
framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move
KVM_CAP_DEVICE_CTRL to the generic check for the two reasons:
1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO).
For example, userspace VFIO implementation may needs to create
KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it
checked at the generic code than at each arch's code.
2) KVM_CREATE_DEVICE has been added to the generic code.
Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.com
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Anup Patel <anup@brainfault.org> (riscv)
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Conflicts:
no support for RISC-V
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit b1a39a718db44ecb18c2a99a11e15f6eedc14c53
Author: Marc Zyngier <maz@kernel.org>
Date: Thu Dec 7 15:12:01 2023 +0000
KVM: Convert comment into an assertion in kvm_io_bus_register_dev()
Instead of having a comment indicating the need to hold slots_lock
when calling kvm_io_bus_register_dev(), make it explicit with
a lockdep assertion.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231207151201.3028710-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit ea61294befd361ab8260c65d53987b400e5599a7
Author: Sean Christopherson <seanjc@google.com>
Date: Wed Oct 18 13:46:24 2023 -0700
Revert "KVM: Prevent module exit until all VMs are freed"
Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
was actually due to failure to flush a workqueue, not a lack of module
refcounting. Pinning the KVM module until kvm_vm_destroy() doesn't
prevent use-after-free due to the module being unloaded, as userspace can
invoke delete_module() the instant the last reference to KVM is put, i.e.
can cause all KVM code to be unmapped while KVM is actively executing said
code.
Generally speaking, the many instances of module_put(THIS_MODULE)
notwithstanding, outside of a few special paths, a module can never safely
put the last reference to itself without creating deadlock, i.e. something
external to the module *must* put the last reference. In other words,
having VMs grab a reference to the KVM module is futile, pointless, and as
evidenced by the now-reverted commit 70375c2d8fa3 ("Revert "KVM: set owner
of cpu and vm file operations""), actively dangerous.
This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit
5f6de5cbebee925a612856fce6f9182bb3eee0db.
Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-32430
commit 087e15206d6ac0d46734e2b0ab34370c0fdca481
Author: Sean Christopherson <seanjc@google.com>
Date: Wed Oct 18 13:46:22 2023 -0700
KVM: Set file_operations.owner appropriately for all such structures
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.
Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit eed52e434bc33603ddb0af62b6c4ef818948489d)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit a7800aa80ea4d5356b8474c2302812e9d4926fa6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.
Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.
Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.
Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.
Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.
To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.
It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 5a475554db1e476a14216e742ea2bdb77362d5d5)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed. Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.
Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.
Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 193bbfaacc84f9ee9c281ec0a8dd2ec8e4821e57)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.
Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found. Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit cec29eef0a815386d520d61c2cbe16d537931639)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail. The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.
Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-9-seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit bb58b90b1a8f753b582055adaf448214a8e22c31)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[RHEL: context differences due to KVM/ARM 6.6 not having been backported yet]
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior. Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.
Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared. PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.
Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f128cf8cfbecccf95e891ae90d9c917df5117c7a)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[RHEL: no loongarch/riscv, conflict differences for MIPS]
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().
KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.
Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit d497a0fab8b8457214fcc9b1a39530920ea7e95e)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.
For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Message-Id: <20231027182217.3615211-4-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 8569992d64b8f750e34b7858eac5d7daaf0f80fd)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.
Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel. A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings. But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end(). And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-3-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit c0db19232c1ed6bd7fcb825c28b014c52732c19e)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Upstream-status: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address. Rename the
handler typedef too (arguably it should always have been gfn_handler_t).
Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-2-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit e97b39c5c4362dc1cbc37a563ddac313b96c84f3)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can
pass event specific information up and down the stack without needing to
constantly expand and churn the APIs. Lockless aging of SPTEs will pass
around a bitmap, and support for memory attributes will pass around the
new attributes for the range.
Add a "KVM_NO_ARG" placeholder to simplify handling events without an
argument (creating a dummy union variable is midly annoying).
Opportunstically drop explicit zero-initialization of the "pte" field, as
omitting the field (now a union) has the same effect.
Cc: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.com
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Yu Zhao <yuzhao@google.com>
Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 3e1efe2b67d3d38116ec010968dbcd89d29e4561)
[RHEL-only: no RISC-V]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop
"arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a
range-based TLB invalidation where the range is defined by the memslot.
Now that kvm_flush_remote_tlbs_range() can be called from common code we
can just use that and drop a bunch of duplicate code from the arch
directories.
Note this adds a lockdep assertion for slots_lock being held when
calling kvm_flush_remote_tlbs_memslot(), which was previously only
asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(),
but they all hold the slots_lock, so the lockdep assertion continues to
hold true.
Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating
kvm_flush_remote_tlbs_memslot(), since it is no longer necessary.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com
(cherry picked from commit 619b5072443c05cf18c31b2c0320cdb42396d411)
[RHEL-only: no RISC-V]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Make kvm_flush_remote_tlbs_range() visible in common code and create a
default implementation that just invalidates the whole TLB.
This paves the way for several future features/cleanups:
- Introduction of range-based TLBI on ARM.
- Eliminating kvm_arch_flush_remote_tlbs_memslot()
- Moving the KVM/x86 TDP MMU to common code.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
(cherry picked from commit d4788996051e3c07fadc6d9b214073fcf78810a8)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
kvm_arch_flush_remote_tlbs() or CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
are two mechanisms to solve the same problem, allowing
architecture-specific code to provide a non-IPI implementation of
remote TLB flushing.
Dropping CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL allows KVM to standardize
all architectures on kvm_arch_flush_remote_tlbs() instead of
maintaining two mechanisms.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-5-rananta@google.com
(cherry picked from commit eddd21481011008792f4e647a5244f6e15970abc)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Rename kvm_arch_flush_remote_tlb() and the associated macro
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and
__KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively.
Making the name plural matches kvm_flush_remote_tlbs() and makes it more
clear that this function can affect more than one remote TLB.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
(cherry picked from commit a1342c8027288e345cc5fd16c6800f9d4eb788ed)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Current usage of kvm_io_device requires users to destruct it with an extra
call of kvm_iodevice_destructor after the device gets unregistered from
kvm_io_bus. This is not necessary and can cause errors if a user forgot
to make the extra call.
Simplify the usage by combining kvm_iodevice_destructor into
kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can
avoid the leakage of destructing the device explicitly.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 5ea5ca3c2b4bf4090232e18cfc515dcb52f914a6)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
Now that KVM honors past and in-progress mmu_notifier invalidations when
reloading the APIC-access page, use KVM's "standard" invalidation hooks
to trigger a reload and delete the one-off usage of invalidate_range().
Aside from eliminating one-off code in KVM, dropping KVM's use of
invalidate_range() will allow common mmu_notifier to redefine the API to
be more strictly focused on invalidating secondary TLBs that share the
primary MMU's page tables.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 0a8a5f2c8c266e9d94fb45f76a26cff135d0051c)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
JIRA: https://issues.redhat.com/browse/RHEL-14702
KVM functions use "long" return values for functions that are wired up
to "struct file_operations", but otherwise use "int" return values for
functions that can return 0/-errno in order to avoid unintentional
divergences between 32-bit and 64-bit kernels.
Some code still uses "long" in unnecessary spots, though, which can
cause a little bit of confusion and unnecessary size casts. Let's
change these spots to use "int" types, too.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230208140105.655814-6-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit f15ba52bfabc3bc130053bd73d414d859162de91)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/3111
JIRA: https://issues.redhat.com/browse/RHEL-1760
Upstream Status: v6.5
Tested: Passed on virt-install, kvm/selftests, kvm-unit-tests, migration, pci-passthrough
This is the first round to rebase kvm-arm to upstream v6.5, which just pick up those independent commit. This merge request contains several parts:
1. Eager page splitting
2. Support writable cpuid reg from userspace
3. misc stuff
And the omitted things in the first round include:
1. FF-A proxy for pKVM
2. Support for Armv8.8 memcpy instructions in userspace
3. Permission Indirection Extension
4. Allow using VHE in the nVHE hypervisor
5. Some Rename Register per auto-gen tools
6. Fix setting SVE and SME traps in hVHE
Signed-off-by: Shaoqin Huang <shahuang@redhat.com>
Approved-by: Sebastian Ott <sebott@redhat.com>
Approved-by: Cornelia Huck <cohuck@redhat.com>
Approved-by: Gavin Shan <gshan@redhat.com>
Approved-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Scott Weaver <scweaver@redhat.com>
Bugzilla: https://bugzilla.redhat.com/2192613
Conflicts:
Missing loongarch and context changes from missing commits
- 9b932aadfc47 ("riscv: kexec: Fixup crash_smp_send_stop without multi
cores")
- aabcaf6ae2a0 ("KVM: PPC: Book3S HV P9: Move host OS save/restore
functions to built-in")
commit 4c8c3c7f70a6779d30f5492acbc9978f4636fe7a
Author: Valentin Schneider <vschneid@redhat.com>
Date: Tue Mar 7 14:35:56 2023 +0000
treewide: Trace IPIs sent via smp_send_reschedule()
To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.
Changes to include the declaration of the tracepoint were driven by the
following coccinelle script:
@func_use@
@@
smp_send_reschedule(...);
@include@
@@
#include <trace/events/ipi.h>
@no_include depends on func_use && !include@
@@
#include <...>
+
+ #include <trace/events/ipi.h>
[csky bits]
[riscv bits]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079
commit eed3013faa401aae662398709410a59bb0646e32
Author: Sean Christopherson <seanjc@google.com>
Date: Tue Jul 11 16:01:25 2023 -0700
KVM: Grab a reference to KVM for VM and vCPU stats file descriptors
Grab a reference to KVM prior to installing VM and vCPU stats file
descriptors to ensure the underlying VM and vCPU objects are not freed
until the last reference to any and all stats fds are dropped.
Note, the stats paths manually invoke fd_install() and so don't need to
grab a reference before creating the file.
Fixes: ce55c04945 ("KVM: stats: Support binary stats retrieval for a VCPU")
Fixes: fcfe1baedd ("KVM: stats: Support binary stats retrieval for a VM")
Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu>
Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com
Cc: stable@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-Id: <20230711230131.648752-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079
commit 5f643e460ab1298a32b7d0db104bfcab9d6165c0
Author: Michal Luczaj <mhal@rbox.co>
Date: Mon Jun 5 13:44:19 2023 +0200
KVM: Clean up kvm_vm_ioctl_create_vcpu()
Since c9d601548603 ("KVM: allow KVM_BUG/KVM_BUG_ON to handle 64-bit cond")
'cond' is internally converted to boolean, so caller's explicit conversion
from void* is unnecessary.
Remove the double bang.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
base-commit: 76a17bf03a268bc342e08c05d8ddbe607d294eb4
Link: https://lore.kernel.org/r/20230605114852.288964-1-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079
commit 52882b9c7a761b2b4e44717d6fbd1ed94c601b7f
Author: Alexey Kardashevskiy <aik@ozlabs.ru>
Date: Wed May 4 17:48:07 2022 +1000
KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.
This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2225079
commit 42a90008f890afc41837dfeec1f0b1e7bcecf94a
Author: David Woodhouse <dwmw@amazon.co.uk>
Date: Wed Jan 11 18:06:50 2023 +0000
KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
Documentation/virt/kvm/locking.rst tells us that kvm->lock is taken outside
vcpu->mutex. But that doesn't actually happen very often; it's only in
some esoteric cases like migration with AMD SEV. This means that lockdep
usually doesn't notice, and doesn't do its job of keeping us honest.
Ensure that lockdep *always* knows about the ordering of these two locks,
by briefly taking vcpu->mutex in kvm_vm_ioctl_create_vcpu() while kvm->lock
is held.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20230111180651.14394-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Wrap the vcpu->pid dereference in the debugfs hook vcpu_get_pid() with
proper RCU read (un)lock. Unlike the code in kvm_vcpu_ioctl(),
vcpu_get_pid() is not a simple access; the pid pointer is passed to
pid_nr() and fully dereferenced if the pointer is non-NULL.
Failure to acquire RCU could result in use-after-free of the old pid if
a different task invokes KVM_RUN and puts the last reference to the old
vcpu->pid between vcpu_get_pid() reading the pointer and dereferencing it
in pid_nr().
Fixes: e36de87d34a7 ("KVM: debugfs: expose pid of vcpu threads")
Link: https://lore.kernel.org/r/20230211010719.982919-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit 76021e96d781e1fe8de02ebe52f3eb276716b6b0)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Fix stale comments that were left behind when install_new_memslots() was
replaced by kvm_swap_active_memslots() as part of the scalable memslots
rework.
Fixes: a54d806688fe ("KVM: Keep memslots in tree-based structures instead of array-based ones")
Signed-off-by: Jun Miao <jun.miao@intel.com>
Link: https://lore.kernel.org/r/20230223052851.1054799-1-jun.miao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
(cherry picked from commit b0d237087c674c43df76c1a0bc2737592f3038f4)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit afb2acb2e3a32e4d56f7fbd819769b98ed1b7520)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}(). The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.
Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.
Fixes: 8e1c18157d ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit e0ceec221f62deb5d6c32c0327030028d3db5f27)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Use syscore_ops.shutdown to disable hardware virtualization during a
reboot instead of using the dedicated reboot_notifier so that KVM disables
virtualization _after_ system_state has been updated. This will allow
fixing a race in KVM's handling of a forced reboot where KVM can end up
enabling hardware virtualization between kernel_restart_prepare() and
machine_restart().
Rename KVM's hook to match the syscore op to avoid any possible confusion
from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
nor "reboot" is completely accurate as the hook handles both).
Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
specific, and to explain why kvm_rebooting exists.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: kvmarm@lists.linux.dev
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Anup Patel <anup@brainfault.org>
Cc: Atish Patra <atishp@atishpatra.org>
Cc: kvm-riscv@lists.infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 6735150b69978a9f73e3d1bab719e81a5dfafa83)
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2203922
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2217329
We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.
The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().
CPU-A CPU-B
----- -----
ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
kvm_vm_ioctl_set_memory_region
kvm_set_memory_region
__kvm_set_memory_region
kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
kvm_invalidate_memslot
kvm_copy_memslot
kvm_replace_memslot
kvm_swap_active_memslots (A)
kvm_arch_flush_shadow_memslot (B)
same page sharing by KSM
kvm_mmu_notifier_invalidate_range_start
:
kvm_mmu_notifier_change_pte
kvm_handle_hva_range
__kvm_handle_hva_range
kvm_set_spte_gfn (C)
:
kvm_mmu_notifier_invalidate_range_end
Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.
We tried a git-bisect and the first problematic commit is cd4c718352 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35
("KVM: ARM: Memory virtualization setup").
Cc: stable@vger.kernel.org # v3.9+
Fixes: d5d8184d35 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Message-Id: <20230615054259.14911-1-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 2230f9e1171a2e9731422a14d1bbc313c0b719d1)
Signed-off-by: Eric Auger <eric.auger@redhat.com>