Commit Graph

1477 Commits

Author SHA1 Message Date
Jim Mattson e2ffe85b6d KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest").  Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).

Note that the quirk only bypasses the consistency check.  The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.

Fixes: 095686e6fc ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2026-03-11 18:41:11 +01:00
Paolo Bonzini bf2c3138ae Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM mediated PMU support for 6.20

Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.

To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param).  The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.

Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:

  Perf command:
  a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
  b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

  Guest performance overhead:
  ---------------------------------------------------------------------------
  | Test case          | emulated vPMU | all passthrough | passthrough with |
  |                    |               |                 | event filters    |
  ---------------------------------------------------------------------------
  | basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
  ---------------------------------------------------------------------------
  | multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
  ---------------------------------------------------------------------------
2026-02-11 12:45:40 -05:00
Paolo Bonzini 1b13885edf Merge tag 'kvm-x86-apic-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM x86 APIC-ish changes for 6.20

 - Fix a benign bug where KVM could use the wrong memslots (ignored SMM) when
   creating a vCPU-specific mapping of guest memory.

 - Clean up KVM's handling of marking mapped vCPU pages dirty.

 - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused
   ASSERT() macro, most of which could be trivially triggered by the guest
   and/or user, and all of which were useless.

 - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it
   more obvious what the weird parameter is used for, and to allow burying the
   RTC shenanigans behind CONFIG_KVM_IOAPIC=y.

 - Bury all of ioapic.h and KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y.

 - Add a regression test for recent APICv update fixes.

 - Rework KVM's handling of VMCS updates while L2 is active to temporarily
   switch to vmcs01 instead of deferring the update until the next nested
   VM-Exit.  The deferred updates approach directly contributed to several
   bugs, was proving to be a maintenance burden due to the difficulty in
   auditing the correctness of deferred updates, and was polluting
   "struct nested_vmx" with a growing pile of booleans.

 - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv()
   to consolidate the updates, and to co-locate SVI updates with the updates
   for KVM's own cache of ISR information.

 - Drop a dead function declaration.
2026-02-11 12:45:32 -05:00
Paolo Bonzini 9e03b7caf4 KVM x86 misc changes for 6.20
- Disallow changing the virtual CPU model if L2 is active, for all the same
    reasons KVM disallows change the model after the first KVM_RUN.
 
  - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
    were advertised as supported to userspace when running with
    KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.
 
  - Fix a bug where KVM would attempt to read protect guest state (CR3) when
    configuring an async #PF entry.
 
  - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
    only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Explicitly allow
    the few exports that are intended for external usage.
 
  - Ignore -EBUSY when checking nested events after a vCPU exits blocking as
    the WARN is user-triggerable, and because exiting to userspace on -EBUSY
    does more harm than good in pretty much every situation.
 
  - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
    in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
    unwinnable game.
 
  - Add support for new Intel instructions that don't require anything beyond
    enumerating feature flags to userspace.
 
  - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.
 
  - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
    setup flow, as nested VMX in particular is sensitive to unexpected changes
    in KVM's golden configuration.
 
  - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
    when the suppression feature is enabled by the guest (currently limited to
    split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
    Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
    on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
    of whether or not userspace I/O APIC supports Directed EOIs).
 
  - Minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmmGqtYACgkQOlYIJqCj
 N/2mURAAq6xms7qH8IpXy7RJjGP7UWVfV7sJPP9N8FWERVfljYn2FGGPAlBi0+5b
 Gbpf3dhEk+JEHPda7Skz3RqnfKqNXszhPRfUxXIW4nlKWs3VCBNtI2XuOc3xGSs+
 itq6jwirPJAibi3GhP3GOnzH3VSdlgq5JhkYW3MGO2JeB0+XMzB+OYE/xZbnRjXg
 i4qwoe9+pGVHpV+rf0MMhCd/46HaGAegPOKArQUbMXQIK3L+6Kgz3y4zy74cCJkI
 nOmevvXztuM8rWrJUl8NvhqNWAak3au6gLg/1CkNcaXp6ekQovZb8BWihQ8JrkOS
 AcmUNqK8RcXXGtjohuXgTgigLg/t+z7tpXiwHC/BxAglf3YB/P2hcxN1/q8zG56T
 s5Ua8RFiosYorlN/LVeyMpPK4MEZQi8QyL/biKIlyoPg3vIL+g7Llf3XdBYsfb4d
 gWGecZTNmEvhwhVbwCqo+2zsO2ATYXKdR+lE8czqqdJ98l+6p652DxA315a6dx7Y
 2fkirbs/JJJotjvukWjWDNk5oGFdX6cDxt2tA1SqDaZ9WTLoqXIIT+9EMtnqXPZm
 KsQLEa5mrM0mbRuOid+Ce+Y1bK4x4DLFaM1oH9BF0UIewo+dMIC/gRgrJEcBS+Vv
 E+XdrCSq2904NX9Gy3OubdorwTloMk+2Sc0HfvsXMytw1LBsUYY=
 =ii2B
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.20' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.20

 - Disallow changing the virtual CPU model if L2 is active, for all the same
   reasons KVM disallows change the model after the first KVM_RUN.

 - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs that
   were advertised as supported to userspace when running with
   KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled.

 - Fix a bug where KVM would attempt to read protect guest state (CR3) when
   configuring an async #PF entry.

 - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86
   only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL.  Explicitly allow
   the few exports that are intended for external usage.

 - Ignore -EBUSY when checking nested events after a vCPU exits blocking as
   the WARN is user-triggerable, and because exiting to userspace on -EBUSY
   does more harm than good in pretty much every situation.

 - Throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is
   in Wait-For-SIPI, as playing whack-a-mole with syzkaller turned out to be an
   unwinnable game.

 - Add support for new Intel instructions that don't require anything beyond
   enumerating feature flags to userspace.

 - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2.

 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended
   setup flow, as nested VMX in particular is sensitive to unexpected changes
   in KVM's golden configuration.

 - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts
   when the suppression feature is enabled by the guest (currently limited to
   split IRQCHIP, i.e. userspace I/O APIC).  Sadly, simply fixing KVM to honor
   Suppress EOI Broadcasts isn't an option as some userspaces have come to rely
   on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective
   of whether or not userspace I/O APIC supports Directed EOIs).

 - Minor cleanups.
2026-02-09 18:53:47 +01:00
Khushit Shah 6517dfbcc9 KVM: x86: Add x2APIC "features" to control EOI broadcast suppression
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support
for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated
by userspace), which KVM completely mishandles. When x2APIC support was
first added, KVM incorrectly advertised and "enabled" Suppress EOI
Broadcast, without fully supporting the I/O APIC side of the equation,
i.e. without adding directed EOI to KVM's in-kernel I/O APIC.

That flaw was carried over to split IRQCHIP support, i.e. KVM advertised
support for Suppress EOI Broadcasts irrespective of whether or not the
userspace I/O APIC implementation supported directed EOIs. Even worse,
KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without
support for directed EOI came to rely on the "spurious" broadcasts.

KVM "fixed" the in-kernel I/O APIC implementation by completely disabling
support for Suppress EOI Broadcasts in commit 0bcc3fb95b ("KVM: lapic:
stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but
didn't do anything to remedy userspace I/O APIC implementations.

KVM's bogus handling of Suppress EOI Broadcast is problematic when the
guest relies on interrupts being masked in the I/O APIC until well after
the initial local APIC EOI. E.g. Windows with Credential Guard enabled
handles interrupts in the following order:
  1. Interrupt for L2 arrives.
  2. L1 APIC EOIs the interrupt.
  3. L1 resumes L2 and injects the interrupt.
  4. L2 EOIs after servicing.
  5. L1 performs the I/O APIC EOI.

Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt
storm, e.g. if the IRQ line is still asserted and userspace reacts to the
EOI by re-injecting the IRQ, because the guest doesn't de-assert the line
until step #4, and doesn't expect the interrupt to be re-enabled until
step #5.

Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way
of knowing if the userspace I/O APIC supports directed EOIs, i.e.
suppressing EOI broadcasts would result in interrupts being stuck masked
in the userspace I/O APIC due to step #5 being ignored by userspace. And
fully disabling support for Suppress EOI Broadcast is also undesirable, as
picking up the fix would require a guest reboot, *and* more importantly
would change the virtual CPU model exposed to the guest without any buy-in
from userspace.

Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and
KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to
explicitly enable or disable support for Suppress EOI Broadcasts. This
gives userspace control over the virtual CPU model exposed to the guest,
as KVM should never have enabled support for Suppress EOI Broadcast without
userspace opt-in. Not setting either flag will result in legacy quirky
behavior for backward compatibility.

Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel
I/O APIC, as KVM's history/support is just as tragic.  E.g. it's not clear
that commit c806a6ad35 ("KVM: x86: call irq notifiers with directed EOI")
was entirely correct, i.e. it may have simply papered over the lack of
Directed EOI emulation in the I/O APIC.

Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's
APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI
is to support Directed EOI (KVM's name) irrespective of guest CPU vendor.

Fixes: 7543a635aa ("KVM: x86: Add KVM exit for IOAPIC EOIs")
Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com
Cc: stable@vger.kernel.org
Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Khushit Shah <khushit.shah@nutanix.com>
Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com
[sean: clean up minor formatting goofs and fix a comment typo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30 13:28:35 -08:00
Zhao Liu 062768f426 KVM: x86: Advertise AVX10_VNNI_INT CPUID to userspace
Define and advertise AVX10_VNNI_INT CPUID to userspace when it's supported
by the host.

AVX10_VNNI_INT (0x24.0x1.ECX[bit 2]) is a discrete feature bit
introduced on Intel Diamond Rapids, which enumerates the support for
EVEX VPDP* instructions for INT8/INT16 [*].

Since this feature has no actual kernel usages, define it as a KVM-only
feature in reverse_cpuid.h.

Advertise new CPUID subleaf 0x24.0x1 with AVX10_VNNI_INT bit to
userspace for guest use. It's safe since no additional enabling work
is needed in the host kernel.

[*]: Intel Advanced Vector Extensions 10.2 Architecture Specification
     (rev 5.0).

Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-5-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23 10:00:02 -08:00
Zhao Liu 58cbaf64e6 KVM: x86: Advertise AMX CPUIDs in subleaf 0x1E.0x1 to userspace
Define and advertise AMX CPUIDs (0x1E.0x1) to userspace when the leaf is
supported by the host.

Intel Diamond Rapids adds new AMX instructions to support new formats
and memory operations [*], and introduces the CPUID subleaf 0x1E.0x1
to centralize the discrete AMX feature bits within EAX.

Since these AMX features have no actual kernel usages, define them as
KVM-only features in reverse_cpuid.h.

In addition to the new features, CPUID 0x1E.0x1.EAX[bits 0-3] are
aliaseed positions of existing AMX feature bits distributed across the
0x7 leaves. To avoid duplicate feature names, name these aliases with an
*_ALIAS suffix, and define them in reverse_cpuid.h as KVM-only features
as well.

Advertise new CPUID subleaf 0x1E.0x1 with its AMX CPUID feature bits to
userspace for guest use. It's safe since no additional enabling work
is needed in the host kernel.

[*]: Intel Architecture Instruction Set Extensions and Future Features
     (rev.059).

Tested-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251120050720.931449-3-zhao1.liu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-23 09:59:57 -08:00
Sean Christopherson fd09d259c1 KVM: x86: Hide KVM_IRQCHIP_KERNEL behind CONFIG_KVM_IOAPIC=y
Enumerate KVM_IRQCHIP_KERNEL if and only if support for an in-kernel I/O
APIC is enabled, as all usage is likewise guarded by CONFIG_KVM_IOAPIC=y.

Link: https://patch.msgid.link/20251206004311.479939-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-12 09:31:42 -08:00
Amit Shah db5e824964 KVM: SVM: Virtualize and advertise support for ERAPS
AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS)
feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER
sequences right after VMEXITs.  ERAPS adds guest/host tags to entries in
the RSB (a.k.a. RAP).  This helps with speculation protection across the
VM boundary, and it also preserves host and guest entries in the RSB that
can improve software performance (which would otherwise be flushed due to
the FILL_RETURN_BUFFER sequences).

Importantly, ERAPS also improves cross-domain security by clearing the RAP
in certain situations.  Specifically, the RAP is cleared in response to
actions that are typically tied to software context switching between
tasks.  Per the APM:

  The ERAPS feature eliminates the need to execute CALL instructions to
  clear the return address predictor in most cases. On processors that
  support ERAPS, return addresses from CALL instructions executed in host
  mode are not used in guest mode, and vice versa. Additionally, the
  return address predictor is cleared in all cases when the TLB is
  implicitly invalidated and in the following cases:

  • MOV CR3 instruction
  • INVPCID other than single address invalidation (operation type 0)

ERAPS also allows CPUs to extends the size of the RSB/RAP from the older
standard (of 32 entries) to a new size, enumerated in CPUID leaf
0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs).

In hardware, ERAPS is always-on, when running in host context, the CPU
uses the full RSB/RAP size without any software changes necessary.
However, when running in guest context, the CPU utilizes the full size of
the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the
VMCB; if the flag is not set, the CPU limits itself to the historical size
of 32 entires.

Requiring software to opt-in for guest usage of RAPs larger than 32 entries
allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in
which the RAP is cleared as well as the guest/host split.  E.g. if the CPU
unconditionally used the full RAP for guests, failure to clear the RAP on
transitions between L1 or L2, or on emulated guest TLB flushes, would
expose the guest to RAP-based attacks as a guest without support for ERAPS
wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient.

Address the ~two broad categories of ERAPS emulation, and advertise
ERAPS support to userspace, along with the RAP size enumerated in CPUID.

1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries
   on several conditions, including CR3 updates.  To handle scenarios
   where a relevant operation is handled in common code (emulation of
   INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and
   create an alias, VCPU_EXREG_ERAPS.  SVM doesn't utilize CR3 dirty
   tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused.
   Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will
   also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs
   as needing a RAP clear without having to add a new request (or other
   mechanism).

2. Nested guests: the ERAPS feature adds host/guest tagging to entries
   in the RSB, but does not distinguish between the guest ASIDs.  To
   prevent the case of an L2 guest poisoning the RSB to attack the L1
   guest, the CPU exposes a new VMCB bit (CLEAR_RAP).  The next
   VMRUN with a VMCB that has this bit set causes the CPU to flush the
   RSB before entering the guest context.  Set the bit in VMCB01 after a
   nested #VMEXIT to ensure the next time the L1 guest runs, its RSB
   contents aren't polluted by the L2's contents.  Similarly, before
   entry into a nested guest, set the bit for VMCB02, so that the L1
   guest's RSB contents are not leaked/used in the L2 context.

Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is
exposed to the guest.  Enabling ALLOW_LARGER_RAP unconditionally wouldn't
cause any functional issues, but ignoring userspace's (and L1's) desires
would put KVM into a grey area, which is especially undesirable due to the
potential security implications.  E.g. if a use case wants to have L1 do
manual RAP clearing even when ERAPS is present in hardware, enabling
ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP.

ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and
later.

Signed-off-by: Amit Shah <amit.shah@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Amit Shah <amit.shah@amd.com>
Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com
2026-01-08 12:12:12 -08:00
Mingwei Zhang 02918f0077 KVM: x86/pmu: Introduce eventsel_hw to prepare for pmu event filtering
Introduce eventsel_hw and fixed_ctr_ctrl_hw to store the actual HW value in
PMU event selector MSRs. In mediated PMU checks events before allowing the
event values written to the PMU MSRs. However, to match the HW behavior,
when PMU event checks fails, KVM should allow guest to read the value back.

This essentially requires an extra variable to separate the guest requested
value from actual PMU MSR value. Note this only applies to event selectors.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:09 -08:00
Dapeng Mi 3e51822b2f KVM: x86/pmu: Start stubbing in mediated PMU support
Introduce enable_mediated_pmu as a global variable, with the intent of
exposing it to userspace a vendor module parameter, to control and reflect
mediated vPMU support.  Wire up the perf plumbing to create+release a
mediated PMU, but defer exposing the parameter to userspace until KVM
support for a mediated PMUs is fully landed.

To (a) minimize compatibility issues, (b) to give userspace a chance to
opt out of the restrictive side-effects of perf_create_mediated_pmu(),
and (c) to avoid adding new dependencies between enabling an in-kernel
irqchip and a mediated vPMU, defer "creating" a mediated PMU in perf
until the first vCPU is created.

Regarding userspace compatibility, an alternative solution would be to
make the mediated PMU fully opt-in, e.g. to avoid unexpected failure due
to perf_create_mediated_pmu() failing.  Ironically, that approach creates
an even bigger compatibility issue, as turning on enable_mediated_pmu
would silently break VMMs that don't utilize KVM_CAP_PMU_CAPABILITY (well,
silently until the guest tried to access PMU assets).

Regarding an in-kernel irqchip, create a mediated PMU if and only if the
VM has an in-kernel local APIC, as the mediated PMU will take a hard
dependency on forwarding PMIs to the guest without bouncing through host
userspace.  Silently "drop" the PMU instead of rejecting KVM_CREATE_VCPU,
as KVM's existing vPMU support doesn't function correctly if the local
APIC is emulated by userspace, e.g. PMIs will never be delivered.  I.e.
it's far, far more likely that rejecting KVM_CREATE_VCPU would cause
problems, e.g. for tests or userspace daemons that just want to probe
basic KVM functionality.

Note!  Deliberately make mediated PMU creation "sticky", i.e. don't unwind
it on failure to create a vCPU.  Practically speaking, there's no harm to
having a VM with a mediated PMU and no vCPUs.  To avoid an "impossible" VM
setup, reject KVM_CAP_PMU_CAPABILITY if a mediated PMU has been created,
i.e. don't let userspace disable PMU support after failed vCPU creation
(with PMU support enabled).

Defer vendor specific requirements and constraints to the future.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/20251206001720.468579-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 11:52:04 -08:00
Paolo Bonzini 679fcce002 KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
 
  - Fix the worst of KVM's lack of EFER.LMSLE emulation.
 
  - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
 
  - Fix incorrect handling of selective CR0 writes when checking intercepts
    during emulation of L2 instructions.
 
  - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
    VMRUN and #VMEXIT.
 
  - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
    interrupt if the guest patched the underlying code after the VM-Exit, e.g.
    when Linux patches code with a temporary INT3.
 
  - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
    userspace, and extend KVM "support" to all policy bits that don't require
    any actual support from KVM.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVFgACgkQOlYIJqCj
 N/3GVA/+ITWLRuY28kGmpDBflp6EyMIDlBe4v+JRFV5Ll/6kq+sYWZTIE8uiDHNP
 E6dbrAnU3xxF2fK5pb3Fq0kKslu/+UTnldt3VN0uGzHlTqvw9itKaCdFnOBMND4Z
 Fs7SbXUd1ZWz5Z/Uq2niJDd3p1YA69i1At+udXterFVKzl3GSmNMvsXhNPjjF3gL
 +purCkHfWLnXhFJYMYgnaWDJUQldR+YfulWwEd2y4qXyfoqOBJhg7DpKuHlewBVT
 2ice4k4nBu47rNl+ZRFM9sFX0959OX8MxykO902UB4+qS39jFzTlyM+LjnxKmCfC
 dzGTh4lhG/1QQme6TYBQ+OgXMj6H+8KqQ+YNbjxjAEgY8hWDdVK0bZMIq/iS16aQ
 VPSf1/GufdvV+dUyyb2DZzf7NhWmKyVGjlN5PnGQQl0x6+LwI5m3EODDLfTHlmlb
 0UEZkXdN74ghT2ExepVyVKeDtbQPJNFN/voBnYr8n0P+9Jf28QuoWD5bTloJIxIJ
 OwjwJq3HbDduq/RCFbiSERMBPYFxYCkxVlt+TI+ONhNCUNfvxefNfftHIx+6Yk73
 IV5g3gWNWkIo4h1yp8zsglwiTStY4qpiR52YlDLN3+btgYcPOAXt/U4nigaomfdR
 NoYbuqD1N+u1P4Vlnr8uUZExQVY+JoIPrB3zPnITni+aucSpp+c=
 =p1kg
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.19:

 - Fix a few missing "VMCB dirty" bugs.

 - Fix the worst of KVM's lack of EFER.LMSLE emulation.

 - Add AVIC support for addressing 4k vCPUs in x2AVIC mode.

 - Fix incorrect handling of selective CR0 writes when checking intercepts
   during emulation of L2 instructions.

 - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
   VMRUN and #VMEXIT.

 - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
   interrupt if the guest patched the underlying code after the VM-Exit, e.g.
   when Linux patches code with a temporary INT3.

 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
   userspace, and extend KVM "support" to all policy bits that don't require
   any actual support from KVM.
2025-11-26 09:48:39 +01:00
Paolo Bonzini de8e8ebb1a KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
    of userspace) could inadvertantly trigger lock contention in the TDX-Module,
    which KVM was either working around in weird, ugly ways, or was simply
    oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
    selftests).
 
  - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
    creating said vCPU failed partway through.
 
  - Fix a few sparse warnings (bad annotation, 0 != NULL).
 
  - Use struct_size() to simplify copying capabilities to userspace.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmkmVkAACgkQOlYIJqCj
 N/18Ow//cWPmXAdJcM0fRtnSGwzIZszGSD63htgdh5UDeJIFVyUGKH7uGhndQUwK
 Uo8jCJ4ikwMxDdCijv+e4eqCCMZjb7HQhFKaauPVCJZOhmZn0br3EB5xX24Qgp8R
 YN5gTheiTCHHVaxAMl9grgi1xTRi6pJRufRebOmtyGKNQkclctXcuSdtw7IEhqdM
 wKM3eyb7qUhUrmt5tBkSyFAioGcPJIHE3vqLjImqDgduinbXJdQa1sek4Br0sX45
 rfISZ2geXDj/Sh7EPrPU1ne5LQbtgzp1WTG6MRCidYfP86riMQUlEMY6odEYAgIX
 kCd+z248OJShF5EYcEmjc894YLHJ0vVXIXKx/qh0+Jiobz3bujk+whaxTNa26rj0
 3qLPGzFpYugtxkGqBYH4q90oUTovEk4922+jPsQ9GKQ26f0q3XzvriEUSOgrvo0Z
 O26OyK7BezqSM5WMMSf/EGI1ESuli5lbBLYDOaNZS35di2YcDEgtaikRETpWwy82
 TGxrjyeW9Pu6M3iTtQsOVHNxA4hU//Qd5HcDj5rcXOg1rgiPV9n2OaCEMwc6qi+V
 VytbGm4IlMsz6AVHqyv3SUIt1Z4LNAZ/FwK8oeBRVd6LNfm6nfyrW6eQFQVLoIpA
 1nyi9XjMg7xj6ubiSEQSTSl9gto8FzVWwLKwZ8dLH7SPvqlz+zY=
 =qGpA
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEAD

KVM TDX changes for 6.19:

 - Overhaul the TDX code to address systemic races where KVM (acting on behalf
   of userspace) could inadvertantly trigger lock contention in the TDX-Module,
   which KVM was either working around in weird, ugly ways, or was simply
   oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
   selftests).

 - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
   creating said vCPU failed partway through.

 - Fix a few sparse warnings (bad annotation, 0 != NULL).

 - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26 09:36:37 +01:00
Brendan Jackman 38ee66cb18 KVM: x86: Unify L1TF flushing under per-CPU variable
Currently the tracking of the need to flush L1D for L1TF is tracked by
two bits: one per-CPU and one per-vCPU.

The per-vCPU bit is always set when the vCPU shows up on a core, so
there is no interesting state that's truly per-vCPU. Indeed, this is a
requirement, since L1D is a part of the physical CPU.

So simplify this by combining the two bits.

The vCPU bit was being written from preemption-enabled regions.  To play
nice with those cases, wrap all calls from KVM and use a raw write so that
request a flush with preemption enabled doesn't trigger what would
effectively be DEBUG_PREEMPT false positives.  Preemption doesn't need to
be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a
flush if the vCPU task is migrated, or if userspace runs the vCPU on a
different task.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
[sean: put raw write in KVM instead of in a hardirq.h variant]
Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-18 16:22:45 -08:00
Lei Chen 446fcce2a5 Revert "x86: kvm: rate-limit global clock updates"
This reverts commit 7e44e4495a.

Commit 7e44e4495a ("x86: kvm: rate-limit global clock updates")
intends to use a kvmclock_update_work to sync ntp corretion
across all vcpus kvmclock, which is based on commit 0061d53daf
("KVM: x86: limit difference between kvmclock updates")

Since kvmclock has been switched to mono raw, this commit can be
reverted.

Signed-off-by: Lei Chen <lei.chen@smartx.com>
Link: https://patch.msgid.link/20250819152027.1687487-3-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17 07:50:24 -08:00
Lei Chen 43ddbf16ed Revert "x86: kvm: introduce periodic global clock updates"
This reverts commit 332967a3ea.

Commit 332967a3ea ("x86: kvm: introduce periodic global clock
updates") introduced a 300s interval work to sync ntp corrections
across all vcpus.

Since commit 53fafdbb8b ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock"), kvmclock switched to mono raw clock,
we can no longer take ntp into consideration.

Signed-off-by: Lei Chen <lei.chen@smartx.com>
Link: https://patch.msgid.link/20250819152027.1687487-2-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-17 07:50:23 -08:00
Omar Sandoval 4da3768e18 KVM: SVM: Don't skip unrelated instruction if INT3/INTO is replaced
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn
instruction, discard the exception and retry the instruction if the code
stream is changed (e.g. by a different vCPU) between when the CPU
executes the instruction and when KVM decodes the instruction to get the
next RIP.

As effectively predicted by commit 6ef88d6e36 ("KVM: SVM: Re-inject
INT3/INTO instead of retrying the instruction"), failure to verify that
the correct INTn instruction was decoded can effectively clobber guest
state due to decoding the wrong instruction and thus specifying the
wrong next RIP.

The bug most often manifests as "Oops: int3" panics on static branch
checks in Linux guests.  Enabling or disabling a static branch in Linux
uses the kernel's "text poke" code patching mechanism.  To modify code
while other CPUs may be executing that code, Linux (temporarily)
replaces the first byte of the original instruction with an int3 (opcode
0xcc), then patches in the new code stream except for the first byte,
and finally replaces the int3 with the first byte of the new code
stream.  If a CPU hits the int3, i.e. executes the code while it's being
modified, then the guest kernel must look up the RIP to determine how to
handle the #BP, e.g. by emulating the new instruction.  If the RIP is
incorrect, then this lookup fails and the guest kernel panics.

The bug reproduces almost instantly by hacking the guest kernel to
repeatedly check a static branch[1] while running a drgn script[2] on
the host to constantly swap out the memory containing the guest's TSS.

[1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
[2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b

Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-13 13:03:19 -08:00
Sean Christopherson c0711f8c61 KVM: TDX: Explicitly set user-return MSRs that *may* be clobbered by the TDX-Module
Set all user-return MSRs to their post-TD-exit value when preparing to run
a TDX vCPU to ensure the value that KVM expects to be loaded after running
the vCPU is indeed the value that's loaded in hardware.  If the TDX-Module
doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't
"restore" VMM state, i.e. won't clobber user-return MSRs to their expected
post-run values, in which case simply updating KVM's "cached" value will
effectively corrupt the cache due to hardware still holding the original
value.

In theory, KVM could conditionally update the current user-return value if
and only if tdh_vp_enter() succeeds, but in practice "success" doesn't
guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module
synthesizes an EPT Violation because it suspects a zero-step attack.

Force-load the expected values instead of trying to decipher whether or
not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify
the benefits.  Effectively avoiding four WRMSRs once per run loop (even if
the vCPU is scheduled out, user-return MSRs only need to be reloaded if
the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise
when amortized over all entries, given the cost of running a TDX vCPU.
E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles,
whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands
of cycles.

Fixes: e0b4f31a3c ("KVM: TDX: restore user ret MSRs")
Cc: stable@vger.kernel.org
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-07 10:59:45 -08:00
Sean Christopherson 94428e3ba3 KVM: TDX: Convert INIT_MEM_REGION and INIT_VCPU to "unlocked" vCPU ioctl
Handle the KVM_TDX_INIT_MEM_REGION and KVM_TDX_INIT_VCPU vCPU sub-ioctls
in the unlocked variant, i.e. outside of vcpu->mutex, in anticipation of
taking kvm->lock along with all other vCPU mutexes, at which point the
sub-ioctls _must_ start without vcpu->mutex held.

No functional change intended.

Reviewed-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:17:30 -08:00
Sean Christopherson b9d5cf6de0 KVM: TDX: WARN if mirror SPTE doesn't have full RWX when creating S-EPT mapping
Pass in the mirror_spte to kvm_x86_ops.set_external_spte() to provide
symmetry with .remove_external_spte(), and assert in TDX that the mirror
SPTE is shadow-present with full RWX permissions (the TDX-Module doesn't
allow the hypervisor to control protections).

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:05:51 -08:00
Sean Christopherson 7139c86065 KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte()
Drop the return code from kvm_x86_ops.remove_external_spte(), a.k.a.
tdx_sept_remove_private_spte(), as KVM simply does a KVM_BUG_ON() failure,
and that KVM_BUG_ON() is redundant since all error paths in TDX also do a
KVM_BUG_ON().

Opportunistically pass the spte instead of the pfn, as the API is clearly
about removing an spte.

Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-05 11:05:50 -08:00
Sean Christopherson 65a70164ab KVM: x86: Add a helper to dedup reporting of unhandled VM-Exits
Add and use a helper, kvm_prepare_unexpected_reason_exit(), to dedup the
code that fills the exit reason and CPU when KVM encounters a VM-Exit that
KVM doesn't know how to handle.

Reviewed-by: yaoyuan@linux.alibaba.com
Reviewed-by: Yao Yuan <yaoyuan@linux.alibaba.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030185004.3372256-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-11-04 09:14:47 -08:00
Sean Christopherson d273b52b6f KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
Move kvm_intr_is_single_vcpu() to lapic.c, drop its export, and make its
"fast" helper local to lapic.c.  kvm_intr_is_single_vcpu() is only usable
if the local APIC is in-kernel, i.e. it most definitely belongs in the
local APIC code.

No functional change intended.

Fixes: cf04ec393e ("KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU")
Link: https://lore.kernel.org/r/20250919003303.1355064-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30 13:40:02 -04:00
Paolo Bonzini 12abeb81c8 KVM x86 CET virtualization support for 6.18
Add support for virtualizing Control-flow Enforcement Technology (CET) on
 Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).
 
 CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
 Branch Tracking (IBT), that can be utilized by software to help provide
 Control-flow integrity (CFI).  SHSTK defends against backward-edge attacks
 (a.k.a. Return-oriented programming (ROP)), while IBT defends against
 forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).
 
 Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
 flow to unauthorized targets in order to execute small snippets of code,
 a.k.a. gadgets, of the attackers choice.  By chaining together several gadgets,
 an attacker can perform arbitrary operations and circumvent the system's
 defenses.
 
 SHSTK defends against backward-edge attacks, which execute gadgets by modifying
 the stack to branch to the attacker's target via RET, by providing a second
 stack that is used exclusively to track control transfer operations.  The
 shadow stack is separate from the data/normal stack, and can be enabled
 independently in user and kernel mode.
 
 When SHSTK is is enabled, CALL instructions push the return address on both the
 data and shadow stack. RET then pops the return address from both stacks and
 compares the addresses.  If the return addresses from the two stacks do not
 match, the CPU generates a Control Protection (#CP) exception.
 
 IBT defends against backward-edge attacks, which branch to gadgets by executing
 indirect CALL and JMP instructions with attacker controlled register or memory
 state, by requiring the target of indirect branches to start with a special
 marker instruction, ENDBRANCH.  If an indirect branch is executed and the next
 instruction is not an ENDBRANCH, the CPU generates a #CP.  Note, ENDBRANCH
 behaves as a NOP if IBT is disabled or unsupported.
 
 From a virtualization perspective, CET presents several problems.  While SHSTK
 and IBT have two layers of enabling, a global control in the form of a CR4 bit,
 and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
 respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
 Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
 option due to complexity, and outright disallowing use of XSTATE to context
 switch SHSTK/IBT state would render the features unusable to most guests.
 
 To limit the overall complexity without sacrificing performance or usability,
 simply ignore the potential virtualization hole, but ensure that all paths in
 KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
 hardware, and the guest has access to at least one of SHSTK or IBT.  I.e. allow
 userspace to advertise one of SHSTK or IBT if both are supported in hardware,
 even though doing so would allow a misbehaving guest to use the unadvertised
 feature.
 
 Fully emulating SHSTK and IBT would also require significant complexity, e.g.
 to track and update branch state for IBT, and shadow stack state for SHSTK.
 Given that emulating large swaths of the guest code stream isn't necessary on
 modern CPUs, punt on emulating instructions that meaningful impact or consume
 SHSTK or IBT.  However, instead of doing nothing, explicitly reject emulation
 of such instructions so that KVM's emulator can't be abused to circumvent CET.
 Disable support for SHSTK and IBT if KVM is configured such that emulation of
 arbitrary guest instructions may be required, specifically if Unrestricted
 Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
 is smaller than host.MAXPHYADDR.
 
 Lastly disable SHSTK support if shadow paging is enabled, as the protections
 for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
 that they can't be directly modified by software), i.e. would require
 non-trivial support in the Shadow MMU.
 
 Note, AMD CPUs currently only support SHSTK.  Explicitly disable IBT support
 so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
 in SVM requires KVM modifications.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXbisACgkQOlYIJqCj
 N/373w//ckB4c9MjS6eDRp+LtTXQfXyAs8eMcs9YTs7yD3uMvqcbaNuDsf1U2cI6
 i2qcuOdxlnKSJphn6oH2JKDWPjRAfHhCqmYghUPaJwgeYqsTfork9s8rzU2tC82q
 38mQ6BhAuOwa/plodvDp/+POEIoXUyexSoWX+cngGVTmFWdbfA4NNGjWMZOl1XG2
 qLBck6t+IxxUTs1Ij+OsexlAKdY7FcZZ85Ok6I/VE4/lITEhuTJkwkYdh8td3KK/
 IVVk1jb1Z7t8lGQ5fi3+N/D8iHJ/0ladmOux6Yxzw88uyj6XLIFOOFsdK09GyhUS
 QzV06syFkV2vU68VDYiOcMZIdeGmYR5jDpmy9N+o0s86YLU6rKKEaXRP7vW5yHj/
 99AU+DfRHvhqKwWyQ51B+rhr80F3EQrkZXI0QBr8KO7sseFZvZNNVozwKjSyZtNH
 VBhxjIlVQm5Z1rjucKjc573sONK95z9XUSZjYnCUwB1NH7VsvdULQmJBucCmzW/p
 9j49CpmShwggceV6LcYg4Miuvjl/bL1B8Go5Fg+1Fdg7L6Nepi16yywxHmyPqreJ
 Wx/6N0gqZ3LKDdl5CFYxAxvJoldJR6lbw/AGjvFkre8A+TGGRdz3uS9XXqGHvtbu
 W5wKhnvGov69lm4xYbxbI+rvxYmmQLm9SgQXel23icbKJ5kmE48=
 =zsBl
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 CET virtualization support for 6.18

Add support for virtualizing Control-flow Enforcement Technology (CET) on
Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).

CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
Branch Tracking (IBT), that can be utilized by software to help provide
Control-flow integrity (CFI).  SHSTK defends against backward-edge attacks
(a.k.a. Return-oriented programming (ROP)), while IBT defends against
forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).

Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
flow to unauthorized targets in order to execute small snippets of code,
a.k.a. gadgets, of the attackers choice.  By chaining together several gadgets,
an attacker can perform arbitrary operations and circumvent the system's
defenses.

SHSTK defends against backward-edge attacks, which execute gadgets by modifying
the stack to branch to the attacker's target via RET, by providing a second
stack that is used exclusively to track control transfer operations.  The
shadow stack is separate from the data/normal stack, and can be enabled
independently in user and kernel mode.

When SHSTK is is enabled, CALL instructions push the return address on both the
data and shadow stack. RET then pops the return address from both stacks and
compares the addresses.  If the return addresses from the two stacks do not
match, the CPU generates a Control Protection (#CP) exception.

IBT defends against backward-edge attacks, which branch to gadgets by executing
indirect CALL and JMP instructions with attacker controlled register or memory
state, by requiring the target of indirect branches to start with a special
marker instruction, ENDBRANCH.  If an indirect branch is executed and the next
instruction is not an ENDBRANCH, the CPU generates a #CP.  Note, ENDBRANCH
behaves as a NOP if IBT is disabled or unsupported.

From a virtualization perspective, CET presents several problems.  While SHSTK
and IBT have two layers of enabling, a global control in the form of a CR4 bit,
and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
option due to complexity, and outright disallowing use of XSTATE to context
switch SHSTK/IBT state would render the features unusable to most guests.

To limit the overall complexity without sacrificing performance or usability,
simply ignore the potential virtualization hole, but ensure that all paths in
KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
hardware, and the guest has access to at least one of SHSTK or IBT.  I.e. allow
userspace to advertise one of SHSTK or IBT if both are supported in hardware,
even though doing so would allow a misbehaving guest to use the unadvertised
feature.

Fully emulating SHSTK and IBT would also require significant complexity, e.g.
to track and update branch state for IBT, and shadow stack state for SHSTK.
Given that emulating large swaths of the guest code stream isn't necessary on
modern CPUs, punt on emulating instructions that meaningful impact or consume
SHSTK or IBT.  However, instead of doing nothing, explicitly reject emulation
of such instructions so that KVM's emulator can't be abused to circumvent CET.
Disable support for SHSTK and IBT if KVM is configured such that emulation of
arbitrary guest instructions may be required, specifically if Unrestricted
Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
is smaller than host.MAXPHYADDR.

Lastly disable SHSTK support if shadow paging is enabled, as the protections
for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
that they can't be directly modified by software), i.e. would require
non-trivial support in the Shadow MMU.

Note, AMD CPUs currently only support SHSTK.  Explicitly disable IBT support
so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
in SVM requires KVM modifications.
2025-09-30 13:37:14 -04:00
Paolo Bonzini d05ca6b793 KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
    where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
    intercepts and trigger a variety of WARNs in KVM.
 
  - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
    supposed to exist for v2 PMUs.
 
  - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
 
  - Clean up KVM's vector hashing code for delivering lowest priority IRQs.
 
  - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
    actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
    in the process of doing so add fastpath support for TSC_DEADLINE writes on
    AMD CPUs.
 
  - Clean up a pile of PMU code in anticipation of adding support for mediated
    vPMUs.
 
  - Add support for the immediate forms of RDMSR and WRMSRNS, sans full
    emulator support (KVM should never need to emulate the MSRs outside of
    forced emulation and other contrived testing scenarios).
 
  - Clean up the MSR APIs in preparation for CET and FRED virtualization, as
    well as mediated vPMU support.
 
  - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
    as KVM can't faithfully emulate an I/O APIC for such guests.
 
  - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
    for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
    response to PMU refreshes for guests with mediated vPMUs.
 
  - Misc cleanups and minor fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXIr0ACgkQOlYIJqCj
 N/1bbhAAxHzxN7IcizgAYf1BZWMjRU4zJgwlkoGuBeH/IgUOODPjs93L9kyrzvVL
 tcFgIe9o5fZRGmUfyZbCKnJaQi/4u/2QPRSGhsYt7vyDjCoXzO5CJPMYIqDz5Z2r
 qg+GNMlLtWI8EbcDd4qT22SWC8GufoXFEQnX6PUNhasOHeKit5ye8wmttcG+zvYV
 KeIkPluddQkQ2JKyG53IFNmm1lkY05oAibv61hkxqUSwCIJKsQFuDjl4GVouAd/H
 eu0+pzNmzPUTQ/qJzr2cNL5Nqz08DGp2OCFFRO6bgXaWkvHnFG3EAEHlhTAUh92t
 LPJxmhb6R8SUc+z8rYTgyF/zVpgeJcJO7F44FrXa7r2iV58ds3TfuO53hVaEfyNp
 1GUMH0m8N2vfjtFyUVP1KwZHuFxiGKLd1wZ1h0yKpj1Eg1FjR2cEontqwH44tHn2
 ENq8MIbWIBhvCsz5fIbM4y591JSevJUrDlYu60Lz7VyXHAw8Cq92t/dN9O7oH5mJ
 pIyoracU1g0Q6bbATZYsOGhkCTYLtdelZaBb5AYIgQ+U4C1TA4GpgEBUSVH8HXDy
 kXzVqSFlL0v5rrFkBPjiNFb5WD3iLjJIM3DLGoNegOM8+79r/USGHUY+XU3z/kCH
 rV8JBlTnLBCrNOHEiHJUI2kwBQ9C9/l88X/VwvRUNv7SthuExSo=
 =9IB0
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 changes for 6.18

 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
   where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
   intercepts and trigger a variety of WARNs in KVM.

 - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
   supposed to exist for v2 PMUs.

 - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.

 - Clean up KVM's vector hashing code for delivering lowest priority IRQs.

 - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
   actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
   in the process of doing so add fastpath support for TSC_DEADLINE writes on
   AMD CPUs.

 - Clean up a pile of PMU code in anticipation of adding support for mediated
   vPMUs.

 - Add support for the immediate forms of RDMSR and WRMSRNS, sans full
   emulator support (KVM should never need to emulate the MSRs outside of
   forced emulation and other contrived testing scenarios).

 - Clean up the MSR APIs in preparation for CET and FRED virtualization, as
   well as mediated vPMU support.

 - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
   as KVM can't faithfully emulate an I/O APIC for such guests.

 - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
   for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
   response to PMU refreshes for guests with mediated vPMUs.

 - Misc cleanups and minor fixes.
2025-09-30 13:36:41 -04:00
Paolo Bonzini a104e0a305 KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
    KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
    instead of latent guest failures.
 
  - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
    host from tampering with the guest's TSC frequency, while still allowing the
    the VMM to configure the guest's TSC frequency prior to launch.
 
  - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
    wrapping all accesses via READ_ONCE().
 
  - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
    bogous XCR0 value in KVM's software model.
 
  - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
    avoid leaving behind stale state (thankfully not consumed in KVM).
 
  - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
    instead of subtly relying on guest_memfd to do the "heavy" lifting.
 
  - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
    desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
    TSC_AUX due to hardware not matching the value cached in the user-return MSR
    infrastructure.
 
  - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
    and clean up the AVIC initialization code along the way.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXH54ACgkQOlYIJqCj
 N/0OCw//e+0o6jov6/PO8ljq6sXJySsXKxEFYnvQlWYzjqtlVs05Y2SY0GBTnMu3
 g0ie2c4V3VD7cY5bGAWETWvrOMLqGXM3E7v9dVOuE4xU3xx0HkCAlXc/woOLUXoT
 jo/komNXnpeiZ1QRO9FlGooHTJ6Y+jg6/mM7asStS2Pk3Mm//wYgQej9mSJDrypo
 NB4+BCS9cyt8rndNtCUkyedFYMboVQ8AEvXh/jeydhw4rdbBh0/Ci2IKGcVI5DP1
 be8GD/FsNTIUDtieHRYCR+LCKCMFj/hYzlg2nQ6UjxHZbvlDyQuh2Ld2LtZiGSef
 ejNr9e+ro6vxWBgX6wplWtKRLxBYEnQ1h/rQ9A3g50TuhrtFJbxBxY7DPQ16hlBJ
 EB/E1JFvVgkGVrYN0oPQCvvfhFtpkx43qnEBw4q0pbdAS79XOnG2GJFvI0hpZAP6
 qwy19lbsJ5g3qLTlDPChxQJC08gThn3CbarCmZNNzBpPDQoLDUfYBfyN4prRPuiN
 UByfaaEC0Fi6JSgmHsO0LsUB9K++k2ucWiIIW4YQhVgPUtCjTNLe9omgGJ1UYe0X
 YITqgklewe3QtBJ46JE0APkPaHio7r6zd7QvO+RhRFkjwZfY6dlsrSImykKrpK3O
 rPaZnW+UpAnA1XIqroMl1RVoczFCfGcP1Cat9JwScBVVxjJ1DlI=
 =zd53
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.18

 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via
   KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
   instead of latent guest failures.

 - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
   host from tampering with the guest's TSC frequency, while still allowing the
   the VMM to configure the guest's TSC frequency prior to launch.

 - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
   wrapping all accesses via READ_ONCE().

 - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
   bogous XCR0 value in KVM's software model.

 - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
   avoid leaving behind stale state (thankfully not consumed in KVM).

 - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
   instead of subtly relying on guest_memfd to do the "heavy" lifting.

 - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
   desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
   TSC_AUX due to hardware not matching the value cached in the user-return MSR
   infrastructure.

 - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
   and clean up the AVIC initialization code along the way.
2025-09-30 13:34:12 -04:00
Paolo Bonzini 5b0d0d8542 KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
    reduce guest jitter when restoring NX huge pages.
 
  - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
    relevant memslot to fix an issue where prefaulting could deadlock with the
    memslot update.
 
  - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
    invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
    to the aforementioned prefaulting case.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmjXHaEACgkQOlYIJqCj
 N/1uDxAAxGMl1q1Hg0tpVPw7PdcourXlVYJjFzsrK6CdtZpL7n2GJPVhEFBDovud
 oIM9IIiP5f2UDtWeRb6b/mm9INqwTB8lyswbJk/tO+CshBiBdE7PfDbzDzvj9lAv
 Uecc6tQhv+CDpJcSf7t5OqgiRo5gEBTXZZj0l5GOdtiaOU09eq4ttZTME5S1jQgh
 kBddFd3glWeMLv67cTNCxdHsOFnaVWIBoupfw7Fv7LVJ1k6cgKyHAhjfq8A9elEK
 3CyDo8DZ8MG4aguhHzAUQuEM9ELMxOTyJG8xS2BWtFA/glbvUBnOfGeyTmHgo/nN
 qKyjytlpmO0yIlehTd/5tLfpidL8l30VN7+nDpqwTjCDEz9bC39zC9zBmKni84Dt
 wItfmELb6lbvprA+FOseiRwk7/2quLrgc4y21GI29Zqbf6wMoQEnRHF/moFZ3cqg
 C/SP1Ev6N5ENM2BZG9mFSRWr8e2yyan8YWs+AUtsBEM82KaeJrMlZ4yqA1m33a5T
 YK5eL3DablObdfvvz1YXCVxByQ7aIbVCpE3VVigeyHrqoR/EFwZMzYLouOI34jjN
 Nj5+Qck6VMhI+OetUlcXS1D/DIHgpDgZFPcgeLURiwO0l62H/gYLHuoCek4YmkIi
 30ZwVXubBWDg5TcxEi5oIbVfyZfHNi+MyeLMWLEy6hEdnFsTsZU=
 =6qMx
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-mmu-6.18' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.18

 - Recover possible NX huge pages within the TDP MMU under read lock to
   reduce guest jitter when restoring NX huge pages.

 - Return -EAGAIN during prefault if userspace concurrently deletes/moves the
   relevant memslot to fix an issue where prefaulting could deadlock with the
   memslot update.

 - Don't retry in TDX's anti-zero-step mitigation if the target memslot is
   invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
   to the aforementioned prefaulting case.
2025-09-30 13:32:27 -04:00
Yang Weijiang b3744c59eb KVM: x86: Allow setting CR4.CET if IBT or SHSTK is supported
Drop X86_CR4_CET from CR4_RESERVED_BITS and instead mark CET as reserved
if and only if IBT *and* SHSTK are unsupported, i.e. allow CR4.CET to be
set if IBT or SHSTK is supported.  This creates a virtualization hole if
the CPU supports both IBT and SHSTK, but the kernel or vCPU model only
supports one of the features.  However, it's entirely legal for a CPU to
have only one of IBT or SHSTK, i.e. the hole is a flaw in the architecture,
not in KVM.

More importantly, so long as KVM is careful to initialize and context
switch both IBT and SHSTK state (when supported in hardware) if either
feature is exposed to the guest, a misbehaving guest can only harm itself.
E.g. VMX initializes host CET VMCS fields based solely on hardware
capabilities.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:17:48 -07:00
Sean Christopherson 296599346c KVM: x86/mmu: WARN on attempt to check permissions for Shadow Stack #PF
Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to
check permissions for a Shadow Stack access as KVM hasn't been taught to
understand the magic Writable=0,Dirty=1 combination that is required for
Shadow Stack accesses, and likely will never learn.  There are no plans to
support Shadow Stacks with the Shadow MMU, and the emulator rejects all
instructions that affect Shadow Stacks, i.e. it should be impossible for
KVM to observe a #PF due to a shadow stack access.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:16:53 -07:00
Chao Gao 338543cbe0 KVM: x86: Check XSS validity against guest CPUIDs
Maintain per-guest valid XSS bits and check XSS validity against them
rather than against KVM capabilities. This is to prevent bits that are
supported by KVM but not supported for a guest from being set.

Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses
if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since
KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the
host_initiated check.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 09:00:45 -07:00
Sean Christopherson 5dca3808b2 KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependencies
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB
helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for
SEV-ES+ guests.
2025-09-23 08:59:49 -07:00
Hou Wenlong 9bc3663507 KVM: x86: Add helper to retrieve current value of user return MSR
In the user return MSR support, the cached value is always the hardware
value of the specific MSR. Therefore, add a helper to retrieve the
cached value, which can replace the need for RDMSR, for example, to
allow SEV-ES guests to restore the correct host hardware value without
using RDMSR.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
[sean: drop "cache" from the name, make it a one-liner, tag for stable]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250923153738.1875174-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 08:55:20 -07:00
Sean Christopherson 4135a9a8cc KVM: SEV: Validate XCR0 provided by guest in GHCB
Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's
software model in order to validate the new XCR0 against KVM's view of
the supported XCR0.  Allowing garbage is thankfully mostly benign, as
kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected
state, xstate_required_size() will simply provide garbage back to the
guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS
will only harm the guest (setting XCR0 will fail).

However, allowing the guest to put junk into a field that KVM assumes is
valid is a CVE waiting to happen.  And as a bonus, using the proper API
eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty.

Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage.  In either case, KVM can't fix the broken guest.

Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits
if XCR0 isn't actually changing (relatively to KVM's previous snapshot).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23 08:55:19 -07:00
Sean Christopherson 6057497336 KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS
Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS
request, and expand the responsibilities of vendor code to recalculate all
intercepts that vary based on userspace input, e.g. instruction intercepts
that are tied to guest CPUID.

Providing a generic recalc request will allow the upcoming mediated PMU
support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are
set by userspace, without having to make multiple calls to/from PMU code.
As a bonus, using a request will effectively coalesce recalcs, e.g. will
reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create,
set CPUID, set PERF_CAPABILITIES (Intel only), set filter).

The downside is that MSR filter changes that are done in isolation will do
a small amount of unnecessary work, but that's already a relatively slow
path, and the cost of recalculating instruction intercepts is negligible.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18 12:57:18 -07:00
Dapeng Mi 06dc910f5e KVM: x86/pmu: Correct typo "_COUTNERS" to "_COUNTERS"
Fix typos. "_COUTNERS" -> "_COUNTERS".

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20250718001905.196989-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-16 12:55:09 -07:00
Sagi Shahar b3a37bff8d KVM: TDX: Reject fully in-kernel irqchip if EOIs are protected, i.e. for TDX VMs
Reject KVM_CREATE_IRQCHIP if the VM type has protected EOIs, i.e. if KVM
can't intercept EOI and thus can't faithfully emulate level-triggered
interrupts that are routed through the I/O APIC.  For TDX VMs, the
TDX-Module owns the VMX EOI-bitmap and configures all IRQ vectors to have
the CPU accelerate EOIs, i.e. doesn't allow KVM to intercept any EOIs.

KVM already requires a split irqchip[1], but does so during vCPU creation,
which is both too late to allow userspace to fallback to a split irqchip
and a less-than-stellar experience for userspace since an -EINVAL on
KVM_VCPU_CREATE is far harder to debug/triage than failure exactly on
KVM_CREATE_IRQCHIP.  And of course, allowing an action that ultimately
fails is arguably a bug regardless of the impact on userspace.

Link: https://lore.kernel.org/lkml/20250222014757.897978-11-binbin.wu@linux.intel.com [1]
Link: https://lore.kernel.org/lkml/aK3vZ5HuKKeFuuM4@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250827011726.2451115-1-sagis@google.com
[sean: massage shortlog+changelog, relocate setting has_protected_eoi]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-16 12:54:15 -07:00
Sean Christopherson b7d97f69ed KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings,
not just private mappings, so that hugepage support plays nice with the
upcoming support for backing non-private memory with guest_memfd.

In addition to getting the max order from guest_memfd for gmem-only
memslots, update TDX's hook to effectively ignore shared mappings, as TDX's
restrictions on page size only apply to Secure EPT mappings.  Do nothing
for SNP, as RMP restrictions apply to both private and shared memory.

Suggested-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250729225455.670324-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:01 -04:00
Ackerley Tng d6c840adfe KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level()
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level()
in anticipation of extending guest_memfd support to non-private memory.

No functional change intended.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba d1e54dd08f KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default"
VM types when running on 64-bit KVM.  This will allow using guest_memfd
to back non-private memory for all VM shapes, by supporting mmap() on
guest_memfd.

Opportunistically clean up various conditionals that become tautologies
once x86 selects KVM_GUEST_MEMFD more broadly.  Specifically, because
SW protected VMs, SEV, and TDX are all 64-bit only, private memory no
longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because
it is effectively a prerequisite.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:35:00 -04:00
Fuad Tabba 19a9a1ab5c KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to
CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only
supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables
guest_memfd in general, which is not exclusively for private memory.
Subsequent patches in this series will add guest_memfd support for
non-CoCo VMs, whose memory is not private.

Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately
reflects its broader scope as the main Kconfig option for all
guest_memfd-backed memory. This provides clearer semantics for the
option and avoids confusion as new features are introduced.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27 04:34:59 -04:00
Yang Weijiang c2aa58b226 KVM: x86: Add kvm_msr_{read,write}() helpers
Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-4-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:49 -07:00
Yang Weijiang d2dcf25a4c KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses
Rename
	kvm_{g,s}et_msr_with_filter()
	kvm_{g,s}et_msr()
to
	kvm_emulate_msr_{read,write}
	__kvm_emulate_msr_{read,write}

to make it more obvious that KVM uses these helpers to emulate guest
behaviors, i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:48 -07:00
Xin Li d90ebf5a06 KVM: x86: Advertise support for the immediate form of MSR instructions
Advertise support for the immediate form of MSR instructions to userspace
if the instructions are supported by the underlying CPU, and KVM is using
VMX, i.e. is running on an Intel-compatible CPU.

For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over-
report support if AMD-compatible CPUs ever implement the immediate forms,
as SVM will likely require explicit enablement in KVM.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:47 -07:00
Xin Li 885df2d210 KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel
Add support for the immediate forms of RDMSR and WRMSRNS (currently
Intel-only).  The immediate variants are only valid in 64-bit mode, and
use a single general purpose register for the data (the register is also
encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR).

The immediate variants are primarily motivated by performance, not code
size: by having the MSR index in an immediate, it is available *much*
earlier in the CPU pipeline, which allows hardware much more leeway about
how a particular MSR is handled.

Intel VMX support for the immediate forms of MSR accesses communicates
exit information to the host as follows:

  1) The immediate form of RDMSR uses VM-Exit Reason 84.

  2) The immediate form of WRMSRNS uses VM-Exit Reason 85.

  3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is
     set to the MSR index that triggered the VM-Exit.

  4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to
     the register encoding used by the immediate form of the instruction,
     i.e. the destination register for RDMSR, and the source for WRMSRNS.

  5) The VM-Exit Instruction Length field records the size of the
     immediate form of the MSR instruction.

To deal with userspace RDMSR exits, stash the destination register in a
new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc.
Alternatively, the register could be saved in kvm_run.msr or re-retrieved
from the VMCS, but the former would require sanitizing the value to ensure
userspace doesn't clobber the value to an out-of-bounds index, and the
latter would require a new one-off kvm_x86_ops hook.

Don't bother adding support for the instructions in KVM's emulator, as the
only way for RDMSR/WRMSR to be encountered is if KVM is emulating large
swaths of code due to invalid guest state, and a vCPU cannot have invalid
guest state while in 64-bit mode.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: minor tweaks, massage and expand changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:46 -07:00
Sean Christopherson 5dfd498bad KVM: x86/pmu: Calculate set of to-be-emulated PMCs at time of WRMSRs
Calculate and track PMCs that are counting instructions/branches retired
when the PMC's event selector (or fixed counter control) is modified
instead evaluating the event selector on-demand.  Immediately recalc a
PMC's configuration on writes to avoid false negatives/positives when
KVM skips an emulated WRMSR, which is guaranteed to occur before the
main run loop processes KVM_REQ_PMU.

Out of an abundance of caution, and because it's relatively cheap, recalc
reprogrammed PMCs in kvm_pmu_handle_event() as well.  Recalculating in
response to KVM_REQ_PMU _should_ be unnecessary, but for now be paranoid
to avoid introducing easily-avoidable bugs in edge cases.  The code can be
removed in the future if necessary, e.g. in the unlikely event that the
overhead of recalculating to-be-emulated PMCs is noticeable.

Note!  Deliberately don't check the PMU event filters, as doing so could
result in KVM consuming stale information.

Tracking which PMCs are counting branches/instructions will allow grabbing
SRCU in the fastpath VM-Exit handlers if and only if a PMC event might be
triggered (to consult the event filters), and will also allow the upcoming
mediated PMU to do the right thing with respect to counting instructions
(the mediated PMU won't be able to update PMCs in the VM-Exit fastpath).

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 11:59:38 -07:00
Vipin Sharma 6777885605 KVM: x86/mmu: Track possible NX huge pages separately for TDP vs. Shadow MMU
Track possible NX huge pages for the TDP MMU separately from Shadow MMUs
in anticipation of doing recovery for the TDP MMU while holding mmu_lock
for read instead of write.

Use a small structure to hold the list of pages along with the number of
pages/entries in the list, as relying on kvm->stat.nx_lpage_splits to
calculating the number of pages to recover would result in over-zapping
when both TDP and Shadow MMUs are active.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Co-developed-by: James Houghton <jthoughton@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250707224720.4016504-2-jthoughton@google.com
[sean: rewrite changelog, use #ifdef instead of dummy KVM_TDP_MMU #define]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19 07:39:10 -07:00
Paolo Bonzini d7f4aac280 Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.17

 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
   with CR0.WP.

 - Move the TDX hardware setup code to tdx.c to better co-locate TDX code
   and eliminate a few global symbols.

 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU doesn't
   use the list).
2025-07-29 08:36:43 -04:00
Paolo Bonzini 1a14928e2e Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.17

 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
   guest.  Failure to honor FREEZE_IN_SMM can bleed host state into the guest.

 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
   prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.

 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.

 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.

 - Recalculate all MSR intercepts from the "source" on MSR filter changes, and
   drop the dedicated "shadow" bitmaps (and their awful "max" size defines).

 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR.

 - Advertise support for LKGS (Load Kernel GS base), a new instruction that's
   loosely related to FRED, but is supported and enumerated independently.

 - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
   a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON).  Use
   the same approach KVM uses for dealing with "impossible" emulation when
   running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
   has architecturally impossible state.

 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
   APERF/MPERF (with many caveats).

 - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
   frequency is unsupported for VMs with a "secure" TSC, and there's no known
   use case for changing the default frequency for other VM types.
2025-07-29 08:36:43 -04:00
Paolo Bonzini 9de13951d5 Merge tag 'kvm-x86-no_assignment-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM VFIO device assignment cleanups for 6.17

Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
2025-07-29 08:36:42 -04:00
Paolo Bonzini f05efcfe07 Merge tag 'kvm-x86-mmio-6.17' of https://github.com/kvm-x86/linux into HEAD
KVM MMIO Stale Data mitigation cleanup for 6.17

Rework KVM's mitigation for the MMIO State Data vulnerability to track
whether or not a vCPU has access to (host) MMIO based on the MMU that will be
used when running in the guest.  The current approach doesn't actually detect
whether or not a guest has access to MMIO, and is prone to false negatives (and
to a lesser extent, false positives), as KVM_DEV_VFIO_FILE_ADD is optional, and
obviously only covers VFIO devices.
2025-07-29 08:36:28 -04:00