summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/lapic.c
Commit message (Collapse)AuthorAgeFilesLines
* KVM: x86: Update APICv on APIC resetJan H. Schönherr2017-11-021-0/+5
| | | | | | | | | | | | | In kvm_apic_set_state() we update the hardware virtualized APIC after the full APIC state has been overwritten. Do the same, when the full APIC state has been reset in kvm_lapic_reset(). This updates some hardware state that was previously forgotten, as far as I can tell. Also, this allows removing some APIC-related reset code from vmx_vcpu_reset(). Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm,lapic: Justify use of swait_active()Davidlohr Bueso2017-09-151-0/+4
| | | | | | | A comment might serve future readers. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: hyperv: support HV_X64_MSR_TSC_FREQUENCY and HV_X64_MSR_APIC_FREQUENCYLadi Prosek2017-08-071-2/+0Star
| | | | | | | | | | | | | | | | | It has been experimentally confirmed that supporting these two MSRs is one of the necessary conditions for nested Hyper-V to use the TSC page. Modern Windows guests are noticeably slower when they fall back to reading timestamps from the HV_X64_MSR_TIME_REF_COUNT MSR instead of using the TSC page. The newly supported MSRs are advertised with the AccessFrequencyRegs partition privilege flag and CPUID.40000003H:EDX[8] "Support for determining timer frequencies is available" (both outside of the scope of this KVM patch). Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: LAPIC: Fix reentrancy issues with preempt notifiersWanpeng Li2017-07-261-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preempt can occur in the preemption timer expiration handler: CPU0 CPU1 preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer hv_timer_is_use == true sched_out sched_in kvm_arch_vcpu_load kvm_lapic_restart_hv_timer restart_apic_timer start_hv_timer already-expired timer or sw timer triggerd in the window start_sw_timer cancel_hv_timer /* back in kvm_lapic_expired_hv_timer */ cancel_hv_timer WARN_ON(!apic->lapic_timer.hv_timer_in_use); ==> Oops This can be reproduced if CONFIG_PREEMPT is enabled. ------------[ cut here ]------------ WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1563 kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm] CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G OE 4.13.0-rc2+ #16 RIP: 0010:kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm] Call Trace: handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xb8/0xd70 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm] ? kvm_arch_vcpu_load+0x47/0x230 [kvm] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 ------------[ cut here ]------------ WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1498 cancel_hv_timer.isra.40+0x4f/0x60 [kvm] CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc2+ #16 RIP: 0010:cancel_hv_timer.isra.40+0x4f/0x60 [kvm] Call Trace: kvm_lapic_expired_hv_timer+0x3e/0xb0 [kvm] handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xb8/0xd70 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm] ? kvm_arch_vcpu_load+0x47/0x230 [kvm] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 This patch fixes it by making the caller of cancel_hv_timer, start_hv_timer and start_sw_timer be in preemption-disabled regions, which trivially avoid any reentrancy issue with preempt notifier. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Add more WARNs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: LAPIC: Fix lapic timer injection delayWanpeng Li2017-06-291-1/+4
| | | | | | | | | | | | | | | | | | | If the TSC deadline timer is programmed really close to the deadline or even in the past, the computation in vmx_set_hv_timer will program the absolute target tsc value to vmcs preemption timer field w/ delta == 0, then plays a vmentry and an upcoming vmx preemption timer fire vmexit dance, the lapic timer injection is delayed due to this duration. Actually the lapic timer which is emulated by hrtimer can handle this correctly. This patch fixes it by firing the lapic timer and injecting a timer interrupt immediately during the next vmentry if the TSC deadline timer is programmed really close to the deadline or even in the past. This saves ~300 cycles on the tsc_deadline_timer test of apic.flat. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: lapic: reorganize restart_apic_timerPaolo Bonzini2017-06-291-45/+40Star
| | | | | | | | Move the code to cancel the hv timer into the caller, just before it starts the hrtimer. Check availability of the hv timer in start_hv_timer. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: lapic: reorganize start_hv_timerPaolo Bonzini2017-06-291-13/+27
| | | | | | | There are many cases in which the hv timer must be canceled. Split out a new function to avoid duplication. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Fix virtual wire modeJan H. Schönherr2017-05-261-1/+2
| | | | | | | | | | | | | | | | | Intel SDM says, that at most one LAPIC should be configured with ExtINT delivery. KVM configures all LAPICs this way. This causes pic_unlock() to kick the first available vCPU from the internal KVM data structures. If this vCPU is not the BSP, but some not-yet-booted AP, the BSP may never realize that there is an interrupt. Fix that by enabling ExtINT delivery only for the BSP. This allows booting a Linux guest without a TSC in the above situation. Otherwise the BSP gets stuck in calibrate_delay_converge(). Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: X86: Fix preempt the preemption timer cancelWanpeng Li2017-05-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preemption can occur during cancel preemption timer, and there will be inconsistent status in lapic, vmx and vmcs field. CPU0 CPU1 preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer vmx_cancel_hv_timer vmx->hv_deadline_tsc = -1 vmcs_clear_bits /* hv_timer_in_use still true */ sched_out sched_in kvm_arch_vcpu_load vmx_set_hv_timer write vmx->hv_deadline_tsc vmcs_set_bits /* back in kvm_lapic_expired_hv_timer */ hv_timer_in_use = false ... vmx_vcpu_run vmx_arm_hv_run write preemption timer deadline spurious preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer WARN_ON(!apic->lapic_timer.hv_timer_in_use); This can be reproduced sporadically during boot of L2 on a preemptible L1, causing a splat on L1. WARNING: CPU: 3 PID: 1952 at arch/x86/kvm/lapic.c:1529 kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm] CPU: 3 PID: 1952 Comm: qemu-system-x86 Not tainted 4.12.0-rc1+ #24 RIP: 0010:kvm_lapic_expired_hv_timer+0xb5/0xd0 [kvm] Call Trace: handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xc9/0x15f0 [kvm_intel] ? lock_acquire+0xdb/0x250 ? lock_acquire+0xdb/0x250 ? kvm_arch_vcpu_ioctl_run+0xdf3/0x1ce0 [kvm] kvm_arch_vcpu_ioctl_run+0xe55/0x1ce0 [kvm] kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm] ? __fget+0xf3/0x210 do_vfs_ioctl+0xa4/0x700 ? __fget+0x114/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x8f/0x750 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 This patch fixes it by disabling preemption while cancelling preemption timer. This way cancel_hv_timer is atomic with respect to kvm_arch_vcpu_load. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-05-091-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: - the rest of MM - various misc things - procfs updates - lib/ updates - checkpatch updates - kdump/kexec updates - add kvmalloc helpers, use them - time helper updates for Y2038 issues. We're almost ready to remove current_fs_time() but that awaits a btrfs merge. - add tracepoints to DAX * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4 selftests/vm: add a test for virtual address range mapping dax: add tracepoint to dax_insert_mapping() dax: add tracepoint to dax_writeback_one() dax: add tracepoints to dax_writeback_mapping_range() dax: add tracepoints to dax_load_hole() dax: add tracepoints to dax_pfn_mkwrite() dax: add tracepoints to dax_iomap_pte_fault() mtd: nand: nandsim: convert to memalloc_noreclaim_*() treewide: convert PF_MEMALLOC manipulations to new helpers mm: introduce memalloc_noreclaim_{save,restore} mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required mm/huge_memory.c: use zap_deposited_table() more time: delete CURRENT_TIME_SEC and CURRENT_TIME gfs2: replace CURRENT_TIME with current_time apparmorfs: replace CURRENT_TIME with current_time() lustre: replace CURRENT_TIME macro fs: ubifs: replace CURRENT_TIME_SEC with current_time fs: ufs: use ktime_get_real_ts64() for birthtime ...
| * mm: introduce kv[mz]alloc helpersMichal Hocko2017-05-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [sfr@canb.auug.org.au: f2fs fixup] Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part] Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Revert "KVM: Support vCPU-based gfn->hva cache"Paolo Bonzini2017-05-031-10/+12
|/ | | | | | | | | | | | | | | | | | This reverts commit bbd6411513aa8ef3ea02abab61318daf87c1af1e. I've been sitting on this revert for too long and it unfortunately missed 4.11. It's also the reason why I haven't merged ring-based dirty tracking for 4.12. Using kvm_vcpu_memslots in kvm_gfn_to_hva_cache_init and kvm_vcpu_write_guest_offset_cached means that the MSR value can now be used to access SMRAM, simply by making it point to an SMRAM physical address. This is problematic because it lets the guest OS overwrite memory that it shouldn't be able to touch. Cc: stable@vger.kernel.org Fixes: bbd6411513aa8ef3ea02abab61318daf87c1af1e Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Support vCPU-based gfn->hva cacheCao, Lei2017-02-161-12/+10Star
| | | | | | | | | | | | | Provide versions of struct gfn_to_hva_cache functions that take vcpu as a parameter instead of struct kvm. The existing functions are not needed anymore, so delete them. This allows dirty pages to be logged in the vcpu dirty ring, instead of the global dirty ring, for ring-based dirty memory tracking. Signed-off-by: Lei Cao <lei.cao@stratus.com> Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injectionPaolo Bonzini2017-02-151-7/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked", 2015-09-18) the posted interrupt descriptor is checked unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to trigger the scan and, if NMIs or SMIs are not involved, we can avoid the complicated event injection path. Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been there since APICv was introduced. However, without the KVM_REQ_EVENT safety net KVM needs to be much more careful about races between vmx_deliver_posted_interrupt and vcpu_enter_guest. First, the IPI for posted interrupts may be issued between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts. If that happens, kvm_trigger_posted_interrupt returns true, but smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is entered with PIR.ON, but the posted interrupt IPI has not been sent and the interrupt is only delivered to the guest on the next vmentry (if any). To fix this, disable interrupts before setting vcpu->mode. This ensures that the IPI is delayed until the guest enters non-root mode; it is then trapped by the processor causing the interrupt to be injected. Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu) and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called but it (correctly) doesn't do anything because it sees vcpu->mode == OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no posted interrupt IPI is pending; this time, the fix for this is to move the RVI update after IN_GUEST_MODE. Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT, though the second could actually happen with VT-d posted interrupts. In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting in another vmentry which would inject the interrupt. This saves about 300 cycles on the self_ipi_* tests of vmexit.flat. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: do not scan IRR twice on APICv vmentryPaolo Bonzini2017-02-151-3/+5
| | | | | | | | | | | Calls to apic_find_highest_irr are scanning IRR twice, once in vmx_sync_pir_from_irr and once in apic_search_irr. Change sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr; now that it does the computation, it can also do the RVI write. In order to avoid complications in svm.c, make the callback optional. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: vmx: move sync_pir_to_irr from apic_find_highest_irr to callersPaolo Bonzini2017-02-151-3/+4
| | | | Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: preparatory changes for APICv cleanupsPaolo Bonzini2017-02-151-8/+17
| | | | | | | Add return value to __kvm_apic_update_irr/kvm_apic_update_irr. Move vmx_sync_pir_to_irr around. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: vmx: clear pending interrupts on KVM_SET_LAPICPaolo Bonzini2017-02-151-2/+1Star
| | | | | | | | Pending interrupts might be in the PI descriptor when the LAPIC is restored from an external state; we do not want them to be injected. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge branch 'x86/cpufeature' of ↵Radim Krčmář2017-01-171-0/+6
|\ | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next For AVX512_VPOPCNTDQ.
| * KVM: x86: flush pending lapic jump label updates on module unloadDavid Matlack2017-01-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled). These are implemented with delayed_work structs which can still be pending when the KVM module is unloaded. We've seen this cause kernel panics when the kvm_intel module is quickly reloaded. Use the new static_key_deferred_flush() API to flush pending updates on module unload. Signed-off-by: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: lapic: do not scan IRR when delivering an interruptPaolo Bonzini2017-01-091-4/+15
| | | | | | | | | | | | | | | | On interrupt delivery the PPR can only grow (except for auto-EOI), so it is impossible that non-auto-EOI interrupt delivery results in KVM_REQ_EVENT. We can therefore use __apic_update_ppr. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: lapic: do not set KVM_REQ_EVENT unnecessarily on PPR updatePaolo Bonzini2017-01-091-1/+2
| | | | | | | | | | | | | | | | On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered. But we can take into account IRR here already. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: lapic: remove unnecessary KVM_REQ_EVENT on PPR updatePaolo Bonzini2017-01-091-12/+24
| | | | | | | | | | | | | | | | | | | | | | PPR needs to be updated whenever on every IRR read because we may have missed TPR writes that _increased_ PPR. However, these writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT has been set already in __apic_accept_irq, or we are going to process the interrupt right away. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: vmx: speed up TPR below threshold vmexitsPaolo Bonzini2017-01-091-0/+6
| | | | | | | | | | | | | | | | Since we're already in VCPU context, all we have to do here is recompute the PPR value. That will in turn generate a KVM_REQ_EVENT if necessary. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: allow hotplug of VCPU with APIC ID over 0xffRadim Krčmář2017-01-091-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LAPIC after reset is in xAPIC mode, which poses a problem for hotplug of VCPUs with high APIC ID, because reset VCPU is waiting for INIT/SIPI, but there is no way to uniquely address it using xAPIC. From many possible options, we chose the one that also works on real hardware: accepting interrupts addressed to LAPIC's x2APIC ID even in xAPIC mode. KVM intentionally differs from real hardware, because real hardware (Knights Landing) does just "x2apic_id & 0xff" to decide whether to accept the interrupt in xAPIC mode and it can deliver one interrupt to more than one physical destination, e.g. 0x123 to 0x123 and 0x23. Fixes: 682f732ecf73 ("KVM: x86: bump MAX_VCPUS to 288") Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: make interrupt delivery fast and slow path behave the sameRadim Krčmář2017-01-091-12/+8Star
| | | | | | | | | | | | | | | | | | | | Slow path tried to prevent IPIs from x2APIC VCPUs from being delivered to xAPIC VCPUs and vice-versa. Make slow path behave like fast path, which never distinguished that. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: replace kvm_apic_id with kvm_{x,x2}apic_idRadim Krčmář2017-01-091-9/+22
|/ | | | | | | | | | | | There were three calls sites: - recalculate_apic_map and kvm_apic_match_physical_addr, where it would only complicate implementation of x2APIC hotplug; - in apic_debug, where it was still somewhat preserved, but keeping the old function just for apic_debug was not worth it Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* ktime: Cleanup ktime_set() usageThomas Gleixner2016-12-251-2/+2
| | | | | | | | | | ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-12-141-73/+139
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "Small release, the most interesting stuff is x86 nested virt improvements. x86: - userspace can now hide nested VMX features from guests - nested VMX can now run Hyper-V in a guest - support for AVX512_4VNNIW and AVX512_FMAPS in KVM - infrastructure support for virtual Intel GPUs. PPC: - support for KVM guests on POWER9 - improved support for interrupt polling - optimizations and cleanups. s390: - two small optimizations, more stuff is in flight and will be in 4.11. ARM: - support for the GICv3 ITS on 32bit platforms" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest KVM: arm/arm64: timer: Check for properly initialized timer on init KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs KVM: x86: Handle the kthread worker using the new API KVM: nVMX: invvpid handling improvements KVM: nVMX: check host CR3 on vmentry and vmexit KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry KVM: nVMX: propagate errors from prepare_vmcs02 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation KVM: nVMX: support restore of VMX capability MSRs KVM: nVMX: generate non-true VMX MSRs based on true versions KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs. KVM: x86: Add kvm_skip_emulated_instruction and use it. KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12 KVM: VMX: Reorder some skip_emulated_instruction calls KVM: x86: Add a return value to kvm_emulate_cpuid KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h ...
| * kvm: x86: CPUID.01H:EDX.APIC[bit 9] should mirror IA32_APIC_BASE[11]Jim Mattson2016-11-221-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | From the Intel SDM, volume 3, section 10.4.3, "Enabling or Disabling the Local APIC," When IA32_APIC_BASE[11] is 0, the processor is functionally equivalent to an IA-32 processor without an on-chip APIC. The CPUID feature flag for the APIC (see Section 10.4.2, "Presence of the Local APIC") is also set to 0. Signed-off-by: Jim Mattson <jmattson@google.com> [Changed subject tag from nVMX to x86.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * kvm: x86: avoid atomic operations on APICv vmentryPaolo Bonzini2016-11-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some benchmarks (e.g. netperf with ioeventfd disabled), APICv posted interrupts turn out to be slower than interrupt injection via KVM_REQ_EVENT. This patch optimizes a bit the IRR update, avoiding expensive atomic operations in the common case where PI.ON=0 at vmentry or the PIR vector is mostly zero. This saves at least 20 cycles (1%) per vmexit, as measured by kvm-unit-tests' inl_from_qemu test (20 runs): | enable_apicv=1 | enable_apicv=0 | mean stdev | mean stdev ----------|-----------------|------------------ before | 5826 32.65 | 5765 47.09 after | 5809 43.42 | 5777 77.02 Of course, any change in the right column is just placebo effect. :) The savings are bigger if interrupts are frequent. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: use ktime_get instead of seeking the hrtimer_clock_basePaolo Bonzini2016-11-021-4/+4
| | | | | | | | | | | | The base clock for the LAPIC timer is always CLOCK_MONOTONIC. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer supportWanpeng Li2016-11-021-28/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most windows guests still utilize APIC Timer periodic/oneshot mode instead of tsc-deadline mode, and the APIC Timer periodic/oneshot mode are still emulated by high overhead hrtimer on host. This patch converts the expected expire time of the periodic/oneshot mode to guest deadline tsc in order to leverage VMX preemption timer logic for APIC Timer tsc-deadline mode. After each preemption timer vmexit preemption timer is restarted to emulate LVTT current-count register is automatically reloaded from the initial-count register when the count reaches 0. This patch reduces ~5600 cycles for each APIC Timer periodic mode operation virtualization. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Squashed with my fixes that were reviewed-by Paolo.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * KVM: LAPIC: rename start/cancel_hv_tscdeadline to start/cancel_hv_timerWanpeng Li2016-11-021-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Rename start/cancel_hv_tscdeadline to start/cancel_hv_timer since they will handle both APIC Timer periodic/oneshot mode and tsc-deadline mode. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * KVM: LAPIC: introduce kvm_get_lapic_target_expiration_tsc()Wanpeng Li2016-11-021-0/+9
| | | | | | | | | | | | | | | | | | | | | | Introdce kvm_get_lapic_target_expiration_tsc() to get APIC Timer target deadline tsc. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * KVM: LAPIC: guarantee the timer is in tsc-deadline modeWanpeng Li2016-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot() and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when rdmsr MSR_IA32_TSCDEADLINE. Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * KVM: LAPIC: extract start_sw_period() to handle periodic/oneshot modeWanpeng Li2016-11-021-42/+47
| | | | | | | | | | | | | | | | | | | | | | Extract start_sw_period() to handle periodic/oneshot mode, it will be used by later patch. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | KVM: x86: fix out-of-bounds access in lapicRadim Krčmář2016-11-241-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff. With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a userspace can send an interrupt with dest_id that results in out-of-bounds access. Found by syzkaller: BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750 Read of size 8 by task syz-executor/22923 CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [...] Call Trace: [...] __dump_stack lib/dump_stack.c:15 [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51 [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156 [...] print_address_description mm/kasan/report.c:194 [...] kasan_report_error mm/kasan/report.c:283 [...] kasan_report+0x231/0x500 mm/kasan/report.c:303 [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329 [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824 [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72 [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157 [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74 [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015 [...] vfs_ioctl fs/ioctl.c:43 [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679 [...] SYSC_ioctl fs/ioctl.c:694 [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 [...] entry_SYSCALL_64_fastpath+0x1f/0xc2 Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: e45115b62f9a ("KVM: x86: use physical LAPIC array for logical x2APIC") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* KVM: lapic: don't recalculate apic map table twice when enabling LAPICWanpeng Li2016-08-191-2/+3
| | | | | | | | | | | | APIC map table is recalculated during reset APIC ID to the initial value when enabling LAPIC. This patch move the recalculate_apic_map() to the next branch since we don't need to recalculate apic map twice in current codes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: lapic: fix access preemption timer stuff even if kernel_irqchip=offWanpeng Li2016-08-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BUG: unable to handle kernel NULL pointer dereference at 000000000000008c IP: [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm] PGD 0 Oops: 0000 [#1] SMP Call Trace: kvm_arch_vcpu_load+0x86/0x260 [kvm] vcpu_load+0x46/0x60 [kvm] kvm_vcpu_ioctl+0x79/0x7c0 [kvm] ? __lock_is_held+0x54/0x70 do_vfs_ioctl+0x96/0x6a0 ? __fget_light+0x2a/0x90 SyS_ioctl+0x79/0x90 do_syscall_64+0x7c/0x1e0 entry_SYSCALL64_slow_path+0x25/0x25 RIP [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm] RSP <ffff8800db1f3d70> CR2: 000000000000008c ---[ end trace a55fb79d2b3b4ee8 ]--- This can be reproduced steadily by kernel_irqchip=off. We should not access preemption timer stuff if lapic is emulated in userspace. This patch fix it by avoiding access preemption timer stuff when kernel_irqchip=off. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2016-08-021-212/+327
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: - ARM: GICv3 ITS emulation and various fixes. Removal of the old VGIC implementation. - s390: support for trapping software breakpoints, nested virtualization (vSIE), the STHYI opcode, initial extensions for CPU model support. - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups, preliminary to this and the upcoming support for hardware virtualization extensions. - x86: support for execute-only mappings in nested EPT; reduced vmexit latency for TSC deadline timer (by about 30%) on Intel hosts; support for more than 255 vCPUs. - PPC: bugfixes. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits) KVM: PPC: Introduce KVM_CAP_PPC_HTM MIPS: Select HAVE_KVM for MIPS64_R{2,6} MIPS: KVM: Reset CP0_PageMask during host TLB flush MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX() MIPS: KVM: Sign extend MFC0/RDHWR results MIPS: KVM: Fix 64-bit big endian dynamic translation MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase MIPS: KVM: Use 64-bit CP0_EBase when appropriate MIPS: KVM: Set CP0_Status.KX on MIPS64 MIPS: KVM: Make entry code MIPS64 friendly MIPS: KVM: Use kmap instead of CKSEG0ADDR() MIPS: KVM: Use virt_to_phys() to get commpage PFN MIPS: Fix definition of KSEGX() for 64-bit KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD kvm: x86: nVMX: maintain internal copy of current VMCS KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures KVM: arm64: vgic-its: Simplify MAPI error handling KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers KVM: arm64: vgic-its: Turn device_id validation into generic ID validation ...
| * KVM: x86: bump KVM_MAX_VCPU_ID to 1023Radim Krčmář2016-07-141-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | kzalloc was replaced with kvm_kvzalloc to allow non-contiguous areas and rcu had to be modified to cope with it. The practical limit for KVM_MAX_VCPU_ID right now is INT_MAX, but lower value was chosen in case there were bugs. 1023 is sufficient maximum APIC ID for 288 VCPUs. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: add a flag to disable KVM x2apic broadcast quirkRadim Krčmář2016-07-141-13/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | Add KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK as a feature flag to KVM_CAP_X2APIC_API. The quirk made KVM interpret 0xff as a broadcast even in x2APIC mode. The enableable capability is needed in order to support standard x2APIC and remain backward compatible. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [Expand kvm_apic_mda comment. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: add KVM_CAP_X2APIC_APIRadim Krčmář2016-07-141-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM_CAP_X2APIC_API is a capability for features related to x2APIC enablement. KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to extend APIC ID in get/set ioctl and MSI addresses to 32 bits. Both are needed to support x2APIC. The feature has to be enableable and disabled by default, because get/set ioctl shifted and truncated APIC ID to 8 bits by using a non-standard protocol inspired by xAPIC and the change is not backward-compatible. Changes to MSI addresses follow the format used by interrupt remapping unit. The upper address word, that used to be 0, contains upper 24 bits of the LAPIC address in its upper 24 bits. Lower 8 bits are reserved as 0. Using the upper address word is not backward-compatible either as we didn't check that userspace zeroed the word. Reserved bits are still not explicitly checked, but non-zero data will affect LAPIC addresses, which will cause a bug. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: reset lapic base in kvm_lapic_resetRadim Krčmář2016-07-141-4/+4
| | | | | | | | | | | | | | | | | | LAPIC is reset in xAPIC mode and the surrounding code expects that. KVM never resets after initialization. This patch is just for sanity. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: reset APIC ID when enabling LAPICRadim Krčmář2016-07-141-2/+3
| | | | | | | | | | | | | | | | APIC ID should be set to the initial APIC ID when enabling LAPIC. This only matters if the guest changes APIC ID. No sane OS does that. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: use hardware-compatible format for APIC ID registerRadim Krčmář2016-07-141-15/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | We currently always shift APIC ID as if APIC was in xAPIC mode. x2APIC mode wants to use more bits and storing a hardware-compabible value is the the sanest option. KVM API to set the lapic expects that bottom 8 bits of APIC ID are in top 8 bits of APIC_ID register, so the register needs to be shifted in x2APIC mode. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: dynamic kvm_apic_mapRadim Krčmář2016-07-141-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | x2APIC supports up to 2^32-1 LAPICs, but most guest in coming years will probably has fewer VCPUs. Dynamic size saves memory at the cost of turning one constant into a variable. apic_map mutex had to be moved before allocation to avoid races with cpu hotplug. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: use physical LAPIC array for logical x2APICRadim Krčmář2016-07-141-36/+37
| | | | | | | | | | | | | | | | | | Logical x2APIC IDs map injectively to physical x2APIC IDs, so we can reuse the physical array for them. This allows us to save space by sizing the logical maps according to the needs of xAPIC. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: add kvm_apic_map_get_dest_lapicRadim Krčmář2016-07-141-132/+98Star
| | | | | | | | | | | | | | | | | | | | | | | | kvm_irq_delivery_to_apic_fast and kvm_intr_is_single_vcpu_fast both compute the interrupt destination. Factor the code. 'struct kvm_lapic **dst = NULL' had to be added to silence GCC. GCC might complain about potential NULL access in the future, because it missed conditions that avoided uninitialized uses of dst. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>