summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-08-117-28/+48
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A few fixes for x86: - Don't reset the carefully adjusted build flags for the purgatory and remove the unwanted flags instead. The 'reset all' approach led to build fails under certain circumstances. - Unbreak CLANG build of the purgatory by avoiding the builtin memcpy/memset implementations. - Address missing prototype warnings by including the proper header - Fix yet more fall-through issues" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/lib/cpu: Address missing prototypes warning x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS x86/purgatory: Do not use __builtin_memcpy and __builtin_memset x86: mtrr: cyrix: Mark expected switch fall-through x86/ptrace: Mark expected switch fall-through
| * x86/lib/cpu: Address missing prototypes warningValdis Klētnieks2019-08-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building with W=1, warnings about missing prototypes are emitted: CC arch/x86/lib/cpu.o arch/x86/lib/cpu.c:5:14: warning: no previous prototype for 'x86_family' [-Wmissing-prototypes] 5 | unsigned int x86_family(unsigned int sig) | ^~~~~~~~~~ arch/x86/lib/cpu.c:18:14: warning: no previous prototype for 'x86_model' [-Wmissing-prototypes] 18 | unsigned int x86_model(unsigned int sig) | ^~~~~~~~~ arch/x86/lib/cpu.c:33:14: warning: no previous prototype for 'x86_stepping' [-Wmissing-prototypes] 33 | unsigned int x86_stepping(unsigned int sig) | ^~~~~~~~~~~~ Add the proper include file so the prototypes are there. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/42513.1565234837@turing-police
| * x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGSNick Desaulniers2019-08-081-5/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KBUILD_CFLAGS is very carefully built up in the top level Makefile, particularly when cross compiling or using different build tools. Resetting KBUILD_CFLAGS via := assignment is an antipattern. The comment above the reset mentions that -pg is problematic. Other Makefiles use `CFLAGS_REMOVE_file.o = $(CC_FLAGS_FTRACE)` when CONFIG_FUNCTION_TRACER is set. Prefer that pattern to wiping out all of the important KBUILD_CFLAGS then manually having to re-add them. Seems also that __stack_chk_fail references are generated when using CONFIG_STACKPROTECTOR or CONFIG_STACKPROTECTOR_STRONG. Fixes: 8fc5b4d4121c ("purgatory: core purgatory functionality") Reported-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190807221539.94583-2-ndesaulniers@google.com
| * x86/purgatory: Do not use __builtin_memcpy and __builtin_memsetNick Desaulniers2019-08-084-23/+17Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implementing memcpy and memset in terms of __builtin_memcpy and __builtin_memset is problematic. GCC at -O2 will replace calls to the builtins with calls to memcpy and memset (but will generate an inline implementation at -Os). Clang will replace the builtins with these calls regardless of optimization level. $ llvm-objdump -dr arch/x86/purgatory/string.o | tail 0000000000000339 memcpy: 339: 48 b8 00 00 00 00 00 00 00 00 movabsq $0, %rax 000000000000033b: R_X86_64_64 memcpy 343: ff e0 jmpq *%rax 0000000000000345 memset: 345: 48 b8 00 00 00 00 00 00 00 00 movabsq $0, %rax 0000000000000347: R_X86_64_64 memset 34f: ff e0 Such code results in infinite recursion at runtime. This is observed when doing kexec. Instead, reuse an implementation from arch/x86/boot/compressed/string.c. This requires to implement a stub function for warn(). Also, Clang may lower memcmp's that compare against 0 to bcmp's, so add a small definition, too. See also: commit 5f074f3e192f ("lib/string.c: implement a basic bcmp") Fixes: 8fc5b4d4121c ("purgatory: core purgatory functionality") Reported-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Debugged-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Debugged-by: Manoj Gupta <manojgupta@google.com> Suggested-by: Alistair Delva <adelva@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Cc: stable@vger.kernel.org Link: https://bugs.chromium.org/p/chromium/issues/detail?id=984056 Link: https://lkml.kernel.org/r/20190807221539.94583-1-ndesaulniers@google.com
| * x86: mtrr: cyrix: Mark expected switch fall-throughGustavo A. R. Silva2019-08-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark switch cases where we are expecting to fall through. Fix the following warning (Building: i386_defconfig i386): arch/x86/kernel/cpu/mtrr/cyrix.c:99:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20190805201712.GA19927@embeddedor
| * x86/ptrace: Mark expected switch fall-throughGustavo A. R. Silva2019-08-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark switch cases where we are expecting to fall through. Fix the following warning (Building: allnoconfig i386): arch/x86/kernel/ptrace.c:202:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (unlikely(value == 0)) ^ arch/x86/kernel/ptrace.c:206:2: note: here default: ^~~~~~~ Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20190805195654.GA17831@embeddedor
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-08-107-49/+44Star
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "Bugfixes (arm and x86) and cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: selftests: kvm: Adding config fragments KVM: selftests: Update gitignore file for latest changes kvm: remove unnecessary PageReserved check KVM: arm/arm64: vgic: Reevaluate level sensitive interrupts on enable KVM: arm: Don't write junk to CP15 registers on reset KVM: arm64: Don't write junk to sysregs on reset KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block x86: kvm: remove useless calls to kvm_para_available KVM: no need to check return value of debugfs_create functions KVM: remove kvm_arch_has_vcpu_debugfs() KVM: Fix leak vCPU's VMCS value into other pCPU KVM: Check preempted_in_kernel for involuntary preemption KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire arm64: KVM: hyp: debug-sr: Mark expected switch fall-through KVM: arm64: Update kvm_arm_exception_class and esr_class_str for new EC KVM: arm: vgic-v3: Mark expected switch fall-through arm64: KVM: regmap: Fix unexpected switch fall-through KVM: arm/arm64: Introduce kvm_pmu_vcpu_init() to setup PMU counter index
| * Merge tag 'kvmarm-fixes-for-5.3' of ↵Paolo Bonzini2019-08-0968-504/+493Star
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm fixes for 5.3 - A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings
| * | x86: kvm: remove useless calls to kvm_para_availablePaolo Bonzini2019-08-051-8/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most code in arch/x86/kernel/kvm.c is called through x86_hyper_kvm, and thus only runs if KVM has been detected. There is no need to check again for the CPUID base. Cc: Sergio Lopez <slp@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: no need to check return value of debugfs_create functionsGreg KH2019-08-051-28/+13Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return void instead of an integer, as we should not care at all about if this function actually does anything or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: <kvm@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: remove kvm_arch_has_vcpu_debugfs()Paolo Bonzini2019-08-052-5/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | There is no need for this function as all arches have to implement kvm_arch_create_vcpu_debugfs() no matter what. A #define symbol let us actually simplify the code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li2019-08-054-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fireWanpeng Li2019-08-051-8/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_set_pending_timer() will take care to wake up the sleeping vCPU which has pending timer, don't need to check this in apic_timer_expired() again. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge branch 'x86/grand-schemozzle' of ↵Linus Torvalds2019-08-065-28/+162
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull pti updates from Thomas Gleixner: "The performance deterioration departement is not proud at all to present yet another set of speculation fences to mitigate the next chapter in the 'what could possibly go wrong' story. The new vulnerability belongs to the Spectre class and affects GS based data accesses and has therefore been dubbed 'Grand Schemozzle' for secret communication purposes. It's officially listed as CVE-2019-1125. Conditional branches in the entry paths which contain a SWAPGS instruction (interrupts and exceptions) can be mis-speculated which results in speculative accesses with a wrong GS base. This can happen on entry from user mode through a mis-speculated branch which takes the entry from kernel mode path and therefore does not execute the SWAPGS instruction. The following speculative accesses are done with user GS base. On entry from kernel mode the mis-speculated branch executes the SWAPGS instruction in the entry from user mode path which has the same effect that the following GS based accesses are done with user GS base. If there is a disclosure gadget available in these code paths the mis-speculated data access can be leaked through the usual side channels. The entry from user mode issue affects all CPUs which have speculative execution. The entry from kernel mode issue affects only Intel CPUs which can speculate through SWAPGS. On CPUs from other vendors SWAPGS has semantics which prevent that. SMAP migitates both problems but only when the CPU is not affected by the Meltdown vulnerability. The mitigation is to issue LFENCE instructions in the entry from kernel mode path for all affected CPUs and on the affected Intel CPUs also in the entry from user mode path unless PTI is enabled because the CR3 write is serializing. The fences are as usual enabled conditionally and can be completely disabled on the kernel command line. The Spectre V1 documentation is updated accordingly. A big "Thank You!" goes to Josh for doing the heavy lifting for this round of hardware misfeature 'repair'. Of course also "Thank You!" to everybody else who contributed in one way or the other" * 'x86/grand-schemozzle' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: Add swapgs description to the Spectre v1 documentation x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS x86/entry/64: Use JMP instead of JMPQ x86/speculation: Enable Spectre v1 swapgs mitigations x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
| * \ \ Merge branch master from ↵Thomas Gleixner2019-07-28178-3323/+4490
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Pick up the spectre documentation so the Grand Schemozzle can be added.
| * | | | x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGSThomas Gleixner2019-07-283-30/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel provided the following information: On all current Atom processors, instructions that use a segment register value (e.g. a load or store) will not speculatively execute before the last writer of that segment retires. Thus they will not use a speculatively written segment value. That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS entry paths can be excluded from the extra LFENCE if PTI is disabled. Create a separate bug flag for the through SWAPGS speculation and mark all out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs are excluded from the whole mitigation mess anyway. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
| * | | | x86/entry/64: Use JMP instead of JMPQJosh Poimboeuf2019-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Somehow the swapgs mitigation entry code patch ended up with a JMPQ instruction instead of JMP, where only the short jump is needed. Some assembler versions apparently fail to optimize JMPQ into a two-byte JMP when possible, instead always using a 7-byte JMP with relocation. For some reason that makes the entry code explode with a #GP during boot. Change it back to "JMP" as originally intended. Fixes: 18ec54fdd6d1 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | x86/speculation: Enable Spectre v1 swapgs mitigationsJosh Poimboeuf2019-07-091-9/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit added macro calls in the entry code which mitigate the Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are enabled. Enable those features where applicable. The mitigations may be disabled with "nospectre_v1" or "mitigations=off". There are different features which can affect the risk of attack: - When FSGSBASE is enabled, unprivileged users are able to place any value in GS, using the wrgsbase instruction. This means they can write a GS value which points to any value in kernel space, which can be useful with the following gadget in an interrupt/exception/NMI handler: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 // dependent load or store based on the value of %reg // for example: mov %(reg1), %reg2 If an interrupt is coming from user space, and the entry code speculatively skips the swapgs (due to user branch mistraining), it may speculatively execute the GS-based load and a subsequent dependent load or store, exposing the kernel data to an L1 side channel leak. Note that, on Intel, a similar attack exists in the above gadget when coming from kernel space, if the swapgs gets speculatively executed to switch back to the user GS. On AMD, this variant isn't possible because swapgs is serializing with respect to future GS-based accesses. NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case doesn't exist quite yet. - When FSGSBASE is disabled, the issue is mitigated somewhat because unprivileged users must use prctl(ARCH_SET_GS) to set GS, which restricts GS values to user space addresses only. That means the gadget would need an additional step, since the target kernel address needs to be read from user space first. Something like: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 mov (%reg1), %reg2 // dependent load or store based on the value of %reg2 // for example: mov %(reg2), %reg3 It's difficult to audit for this gadget in all the handlers, so while there are no known instances of it, it's entirely possible that it exists somewhere (or could be introduced in the future). Without tooling to analyze all such code paths, consider it vulnerable. Effects of SMAP on the !FSGSBASE case: - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not susceptible to Meltdown), the kernel is prevented from speculatively reading user space memory, even L1 cached values. This effectively disables the !FSGSBASE attack vector. - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP still prevents the kernel from speculatively reading user space memory. But it does *not* prevent the kernel from reading the user value from L1, if it has already been cached. This is probably only a small hurdle for an attacker to overcome. Thanks to Dave Hansen for contributing the speculative_smap() function. Thanks to Andrew Cooper for providing the inside scoop on whether swapgs is serializing on AMD. [ tglx: Fixed the USER fence decision and polished the comment as suggested by Dave Hansen ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
| * | | | x86/speculation: Prepare entry code for Spectre v1 swapgs mitigationsJosh Poimboeuf2019-07-093-3/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spectre v1 isn't only about array bounds checks. It can affect any conditional checks. The kernel entry code interrupt, exception, and NMI handlers all have conditional swapgs checks. Those may be problematic in the context of Spectre v1, as kernel code can speculatively run with a user GS. For example: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg mov (%reg), %reg1 When coming from user space, the CPU can speculatively skip the swapgs, and then do a speculative percpu load using the user GS value. So the user can speculatively force a read of any kernel value. If a gadget exists which uses the percpu value as an address in another load/store, then the contents of the kernel value may become visible via an L1 side channel attack. A similar attack exists when coming from kernel space. The CPU can speculatively do the swapgs, causing the user GS to get used for the rest of the speculative window. The mitigation is similar to a traditional Spectre v1 mitigation, except: a) index masking isn't possible; because the index (percpu offset) isn't user-controlled; and b) an lfence is needed in both the "from user" swapgs path and the "from kernel" non-swapgs path (because of the two attacks described above). The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a CR3 write when PTI is enabled. Since CR3 writes are serializing, the lfences can be skipped in those cases. On the other hand, the kernel entry swapgs paths don't depend on PTI. To avoid unnecessary lfences for the user entry case, create two separate features for alternative patching: X86_FEATURE_FENCE_SWAPGS_USER X86_FEATURE_FENCE_SWAPGS_KERNEL Use these features in entry code to patch in lfences where needed. The features aren't enabled yet, so there's no functional change. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
* | | | | x86/vdso/32: Use 32bit syscall fallbackThomas Gleixner2019-07-311-0/+36
| |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic VDSO implementation uses the Y2038 safe clock_gettime64() and clock_getres_time64() syscalls as fallback for 32bit VDSO. This breaks seccomp setups because these syscalls might be not (yet) allowed. Implement the 32bit variants which use the legacy syscalls and select the variant in the core library. The 64bit time variants are not removed because they are required for the time64 based vdso accessors. Fixes: 7ac870747988 ("x86/vdso: Switch to generic vDSO implementation") Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Reported-by: Paul Bolle <pebolle@tiscali.nl> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20190728131648.879156507@linutronix.de
* | | | Merge tag 'spdx-5.3-rc2' of ↵Linus Torvalds2019-07-284-4/+4
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX fixes from Greg KH: "Here are some small SPDX fixes for 5.3-rc2 for things that came in during the 5.3-rc1 merge window that we previously missed. Only three small patches here: - two uapi patches to resolve some SPDX tags that were not correct - fix an invalid SPDX tag in the iomap Makefile file All have been properly reviewed on the public mailing lists" * tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: iomap: fix Invalid License ID treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers again treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers
| * | | | treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headersMasahiro Yamada2019-07-254-4/+4
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UAPI headers licensed under GPL are supposed to have exception "WITH Linux-syscall-note" so that they can be included into non-GPL user space application code. The exception note is missing in some UAPI headers. Some of them slipped in by the treewide conversion commit b24413180f56 ("License cleanup: add SPDX GPL-2.0 license identifier to files with no license"). Just run: $ git show --oneline b24413180f56 -- arch/x86/include/uapi/asm/ I believe they are not intentional, and should be fixed too. This patch was generated by the following script: git grep -l --not -e Linux-syscall-note --and -e SPDX-License-Identifier \ -- :arch/*/include/uapi/asm/*.h :include/uapi/ :^*/Kbuild | while read file do sed -i -e '/[[:space:]]OR[[:space:]]/s/\(GPL-[^[:space:]]*\)/(\1 WITH Linux-syscall-note)/g' \ -e '/[[:space:]]or[[:space:]]/s/\(GPL-[^[:space:]]*\)/(\1 WITH Linux-syscall-note)/g' \ -e '/[[:space:]]OR[[:space:]]/!{/[[:space:]]or[[:space:]]/!s/\(GPL-[^[:space:]]*\)/\1 WITH Linux-syscall-note/g}' $file done After this patch is applied, there are 5 UAPI headers that do not contain "WITH Linux-syscall-note". They are kept untouched since this exception applies only to GPL variants. $ git grep --not -e Linux-syscall-note --and -e SPDX-License-Identifier \ -- :arch/*/include/uapi/asm/*.h :include/uapi/ :^*/Kbuild include/uapi/drm/panfrost_drm.h:/* SPDX-License-Identifier: MIT */ include/uapi/linux/batman_adv.h:/* SPDX-License-Identifier: MIT */ include/uapi/linux/qemu_fw_cfg.h:/* SPDX-License-Identifier: BSD-3-Clause */ include/uapi/linux/vbox_err.h:/* SPDX-License-Identifier: MIT */ include/uapi/linux/virtio_iommu.h:/* SPDX-License-Identifier: BSD-3-Clause */ Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-07-287-23/+75
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 fixes and functional updates: - Prevent stale huge I/O TLB mappings on 32bit. A long standing bug which got exposed by KPTI support for 32bit - Prevent bogus access_ok() warnings in arch_stack_walk_user() - Add display quirks for Lenovo devices which have height and width swapped - Add the missing CR2 fixup for 32 bit async pagefaults. Fallout of the CR2 bug fix series. - Unbreak handling of force enabled HPET by moving the 'is HPET counting' check back to the original place. - A more accurate check for running on a hypervisor platform in the MDS mitigation code. Not perfect, but more accurate than the previous one. - Update a stale and confusing comment vs. IRQ stacks" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mds: Apply more accurate check on hypervisor platform x86/hpet: Undo the early counter is counting check x86/entry/32: Pass cr2 to do_async_page_fault() x86/irq/64: Update stale comment x86/sysfb_efi: Add quirks for some devices with swapped width and height x86/stacktrace: Prevent access_ok() warnings in arch_stack_walk_user() mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() x86/mm: Sync also unmappings in vmalloc_sync_all() x86/mm: Check for pfn instead of page in vmalloc_sync_one()
| * | | | x86/speculation/mds: Apply more accurate check on hypervisor platformZhenzhong Duan2019-07-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | X86_HYPER_NATIVE isn't accurate for checking if running on native platform, e.g. CONFIG_HYPERVISOR_GUEST isn't set or "nopv" is enabled. Checking the CPU feature bit X86_FEATURE_HYPERVISOR to determine if it's running on native platform is more accurate. This still doesn't cover the platforms on which X86_FEATURE_HYPERVISOR is unsupported, e.g. VMware, but there is nothing which can be done about this scenario. Fixes: 8a4b06d391b0 ("x86/speculation/mds: Add sysfs reporting for MDS") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1564022349-17338-1-git-send-email-zhenzhong.duan@oracle.com
| * | | | x86/hpet: Undo the early counter is counting checkThomas Gleixner2019-07-251-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rui reported that on a Pentium D machine which has HPET forced enabled because it is not advertised by ACPI, the early counter is counting check leads to a silent boot hang. The reason is that the ordering of checking the counter first and then reconfiguring the HPET fails to work on that machine. As the HPET is not advertised and presumably not initialized by the BIOS the early enable and the following reconfiguration seems to bring it into a broken state. Adding clocksource=jiffies to the command line results in the following clocksource watchdog warning: clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc-early' as unstable because the skew is too large: clocksource: 'hpet' wd_now: 33 wd_last: 33 mask: ffffffff That clearly shows that the HPET is not counting after it got reconfigured and reenabled. If the counter is not working then the HPET timer is not expiring either, which explains the boot hang. Move the counter is counting check after the full configuration again to unbreak these systems. Reported-by: Rui Salvaterra <rsalvaterra@gmail.com> Fixes: 3222daf970f3 ("x86/hpet: Separate counter check out of clocksource register code") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rui Salvaterra <rsalvaterra@gmail.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907250810530.1791@nanos.tec.linutronix.de
| * | | | x86/entry/32: Pass cr2 to do_async_page_fault()Matt Mullins2019-07-241-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") added the address parameter to do_async_page_fault(), but does not pass it from the 32-bit entry point. To plumb it through, factor-out common_exception_read_cr2 in the same fashion as common_exception, and uses it from both page_fault and async_page_fault. For a 32-bit KVM guest, this fixes: Run /sbin/init as init process Starting init: /sbin/init exists but couldn't execute it (error -14) Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") Signed-off-by: Matt Mullins <mmullins@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190724042058.24506-1-mmullins@fb.com
| * | | | x86/irq/64: Update stale commentCao jin2019-07-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e6401c130931 ("x86/irq/64: Split the IRQ stack into its own pages") missed to update one piece of comment as it did to its peer in Xen, which will confuse people who still need to read comment. Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190719081635.26528-1-caoj.fnst@cn.fujitsu.com
| * | | | x86/sysfb_efi: Add quirks for some devices with swapped width and heightHans de Goede2019-07-221-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some Lenovo 2-in-1s with a detachable keyboard have a portrait screen but advertise a landscape resolution and pitch, resulting in a messed up display if the kernel tries to show anything on the efifb (because of the wrong pitch). Fix this by adding a new DMI match table for devices which need to have their width and height swapped. At first it was tried to use the existing table for overriding some of the efifb parameters, but some of the affected devices have variants with different LCD resolutions which will not work with hardcoded override values. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1730783 Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190721152418.11644-1-hdegoede@redhat.com
| * | | | x86/stacktrace: Prevent access_ok() warnings in arch_stack_walk_user()Eiichi Tsukata2019-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When arch_stack_walk_user() is called from atomic contexts, access_ok() can trigger the following warning if compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y. Reproducer: // CONFIG_DEBUG_ATOMIC_SLEEP=y # cd /sys/kernel/debug/tracing # echo 1 > options/userstacktrace # echo 1 > events/irq/irq_handler_entry/enable WARNING: CPU: 0 PID: 2649 at arch/x86/kernel/stacktrace.c:103 arch_stack_walk_user+0x6e/0xf6 CPU: 0 PID: 2649 Comm: bash Not tainted 5.3.0-rc1+ #99 RIP: 0010:arch_stack_walk_user+0x6e/0xf6 Call Trace: <IRQ> stack_trace_save_user+0x10a/0x16d trace_buffer_unlock_commit_regs+0x185/0x240 trace_event_buffer_commit+0xec/0x330 trace_event_raw_event_irq_handler_entry+0x159/0x1e0 __handle_irq_event_percpu+0x22d/0x440 handle_irq_event_percpu+0x70/0x100 handle_irq_event+0x5a/0x8b handle_edge_irq+0x12f/0x3f0 handle_irq+0x34/0x40 do_IRQ+0xa6/0x1f0 common_interrupt+0xf/0xf </IRQ> Fix it by calling __range_not_ok() directly instead of access_ok() as copy_from_user_nmi() does. This is fine here because the actual copy is inside a pagefault disabled region. Reported-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190722083216.16192-2-devel@etsukata.com
| * | | | x86/mm: Sync also unmappings in vmalloc_sync_all()Joerg Roedel2019-07-221-8/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With huge-page ioremap areas the unmappings also need to be synced between all page-tables. Otherwise it can cause data corruption when a region is unmapped and later re-used. Make the vmalloc_sync_one() function ready to sync unmappings and make sure vmalloc_sync_all() iterates over all page-tables even when an unmapped PMD is found. Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-3-joro@8bytes.org
| * | | | x86/mm: Check for pfn instead of page in vmalloc_sync_one()Joerg Roedel2019-07-221-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not require a struct page for the mapped memory location because it might not exist. This can happen when an ioremapped region is mapped with 2MB pages. Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-2-joro@8bytes.org
* | | | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2019-07-282-5/+4Star
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A pile of perf related fixes: Kernel: - Fix SLOTS PEBS event constraints for Icelake CPUs - Add the missing mask bit to allow counting hardware generated prefetches on L3 for Icelake CPUs - Make the test for hypervisor platforms more accurate (as far as possible) - Handle PMUs correctly which override event->cpu - Yet another missing fallthrough annotation Tools: perf.data: - Fix loading of compressed data split across adjacent records - Fix buffer size setting for processing CPU topology perf.data header. perf stat: - Fix segfault for event group in repeat mode - Always separate "stalled cycles per insn" line, it was being appended to the "instructions" line. perf script: - Fix --max-blocks man page description. - Improve man page description of metrics. - Fix off by one in brstackinsn IPC computation. perf probe: - Avoid calling freeing routine multiple times for same pointer. perf build: - Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings treated as errors, breaking the build" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Mark expected switch fall-throughs perf/core: Fix creating kernel counters for PMUs that override event->cpu perf/x86: Apply more accurate check on hypervisor platform perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register perf/x86/intel: Fix SLOTS PEBS event constraint perf build: Do not use -Wshadow on gcc < 4.8 perf probe: Avoid calling freeing routine multiple times for same pointer perf probe: Set pev->nargs to zero after freeing pev->args entries perf session: Fix loading of compressed data split across adjacent records perf stat: Always separate stalled cycles per insn perf stat: Fix segfault for event group in repeat mode perf tools: Fix proper buffer size for feature processing perf script: Fix off by one in brstackinsn IPC computation perf script: Improve man page description of metrics perf script: Fix --max-blocks man page description
| * | | | perf/x86/intel: Mark expected switch fall-throughsGustavo A. R. Silva2019-07-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/x86/events/intel/core.c: In function ‘intel_pmu_init’: arch/x86/events/intel/core.c:4959:8: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/events/intel/core.c:5008:8: warning: this statement may fall through [-Wimplicit-fallthrough=] Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190624161913.GA32270@embeddedor Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86: Apply more accurate check on hypervisor platformZhenzhong Duan2019-07-251-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check_msr is used to fix a bug report in guest where KVM doesn't support LBR MSR and cause #GP. The msr check is bypassed on real HW to workaround a false failure, see commit d0e1a507bdc7 ("perf/x86/intel: Disable check_msr for real HW") When running a guest with CONFIG_HYPERVISOR_GUEST not set or "nopv" enabled, current check isn't enough and #GP could trigger. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1564022366-18293-1-git-send-email-zhenzhong.duan@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x registerYunying Sun2019-07-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x register is valid, and used for counting hardware generated prefetches of L3 cache. Update the bitmask to allow bit 13. Before: $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3 Performance counter stats for 'sleep 3': <not supported> cpu/event=0xb7,umask=0x1,config1=0x1bfff/u After: $ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3 Performance counter stats for 'sleep 3': 9,293 cpu/event=0xb7,umask=0x1,config1=0x1bfff/u Signed-off-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: bp@alien8.de Cc: hpa@zytor.com Cc: jolsa@redhat.com Cc: namhyung@kernel.org Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | perf/x86/intel: Fix SLOTS PEBS event constraintKan Liang2019-07-251-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sampling SLOTS event and ref-cycles event in a group on Icelake gives EINVAL. SLOTS event is the event stands for the fixed counter 3, not fixed counter 2. Wrong mask was set to SLOTS event in intel_icl_pebs_event_constraints[]. Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Link: https://lkml.kernel.org/r/20190723200429.8180-1-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge tag 'Wimplicit-fallthrough-5.3-rc2' of ↵Linus Torvalds2019-07-271-0/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull Wimplicit-fallthrough enablement from Gustavo A. R. Silva: "This marks switch cases where we are expecting to fall through, and globally enables the -Wimplicit-fallthrough option in the main Makefile. Finally, some missing-break fixes that have been tagged for -stable: - drm/amdkfd: Fix missing break in switch statement - drm/amdgpu/gfx10: Fix missing break in switch statement With these changes, we completely get rid of all the fall-through warnings in the kernel" * tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: Makefile: Globally enable fall-through warning drm/i915: Mark expected switch fall-throughs drm/amd/display: Mark expected switch fall-throughs drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning drm/amdgpu/gfx10: Fix missing break in switch statement drm/amdkfd: Fix missing break in switch statement perf/x86/intel: Mark expected switch fall-throughs mtd: onenand_base: Mark expected switch fall-through afs: fsclient: Mark expected switch fall-throughs afs: yfsclient: Mark expected switch fall-throughs can: mark expected switch fall-throughs firewire: mark expected switch fall-throughs
| * | | | perf/x86/intel: Mark expected switch fall-throughsGustavo A. R. Silva2019-07-261-0/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/x86/events/intel/core.c: In function ‘intel_pmu_init’: arch/x86/events/intel/core.c:4959:8: warning: this statement may fall through [-Wimplicit-fallthrough=] pmem = true; ~~~~~^~~~~~ arch/x86/events/intel/core.c:4960:2: note: here case INTEL_FAM6_SKYLAKE_MOBILE: ^~~~ arch/x86/events/intel/core.c:5008:8: warning: this statement may fall through [-Wimplicit-fallthrough=] pmem = true; ~~~~~^~~~~~ arch/x86/events/intel/core.c:5009:2: note: here case INTEL_FAM6_ICELAKE_MOBILE: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
* | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-07-246-12/+43
|\ \ \ \ | |/ / / |/| | / | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "Bugfixes, a pvspinlock optimization, and documentation moving" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption Documentation: move Documentation/virtual to Documentation/virt KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after free KVM: X86: Dynamically allocate user_fpu KVM: X86: Fix fpu state crash in kvm guest Revert "kvm: x86: Use task structs fpu field for user" KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested
| * | KVM: X86: Boost queue head vCPU to mitigate lock waiter preemptionWanpeng Li2019-07-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks) introduces hybrid PV queued/unfair locks - queued mode (no starvation) - unfair mode (good performance on not heavily contended lock) The lock waiter goes into the unfair mode especially in VMs with over-commit vCPUs since increaing over-commitment increase the likehood that the queue head vCPU may have been preempted and not actively spinning. However, reschedule queue head vCPU timely to acquire the lock still can get better performance than just depending on lock stealing in over-subscribe scenario. Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM: ebizzy -M vanilla boosting improved 1VM 23520 25040 6% 2VM 8000 13600 70% 3VM 3100 5400 74% The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue head vCPU which is involuntary preemption or the one which is voluntary halt due to fail to acquire the lock after a short spin in the guest. Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | Documentation: move Documentation/virtual to Documentation/virtChristoph Hellwig2019-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Renaming docs seems to be en vogue at the moment, so fix on of the grossly misnamed directories. We usually never use "virtual" as a shortcut for virtualization in the kernel, but always virt, as seen in the virt/ top-level directory. Fix up the documentation to match that. Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after freeJan Kiszka2019-07-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Shall help finding use-after-free bugs earlier. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: X86: Dynamically allocate user_fpuWanpeng Li2019-07-224-5/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After reverting commit 240c35a3783a (kvm: x86: Use task structs fpu field for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3) is the order at which allocations are deemed costly to service. In serveless scenario, one host can service hundreds/thoudands firecracker/kata-container instances, howerver, new instance will fail to launch after memory is too fragmented to allocate kvm_vcpu struct on host, this was observed in some cloud provider product environments. This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my Skylake server. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: X86: Fix fpu state crash in kvm guestWanpeng Li2019-07-221-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The idea before commit 240c35a37 (which has just been reverted) was that we have the following FPU states: userspace (QEMU) guest --------------------------------------------------------------------------- processor vcpu->arch.guest_fpu >>> KVM_RUN: kvm_load_guest_fpu vcpu->arch.user_fpu processor >>> preempt out vcpu->arch.user_fpu current->thread.fpu >>> preempt in vcpu->arch.user_fpu processor >>> back to userspace >>> kvm_put_guest_fpu processor vcpu->arch.guest_fpu --------------------------------------------------------------------------- With the new lazy model we want to get the state back to the processor when schedule in from current->thread.fpu. Reported-by: Thomas Lambertz <mail@thomaslambertz.de> Reported-by: anthony <antdev66@gmail.com> Tested-by: anthony <antdev66@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lambertz <mail@thomaslambertz.de> Cc: anthony <antdev66@gmail.com> Cc: stable@vger.kernel.org Fixes: 5f409e20b (x86/fpu: Defer FPU state load until return to userspace) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Add a comment in front of the warning. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | Revert "kvm: x86: Use task structs fpu field for user"Paolo Bonzini2019-07-222-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 240c35a3783ab9b3a0afaba0dde7291295680a6b ("kvm: x86: Use task structs fpu field for user", 2018-11-06). The commit is broken and causes QEMU's FPU state to be destroyed when KVM_RUN is preempted. Fixes: 240c35a3783a ("kvm: x86: Use task structs fpu field for user") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nestedJan Kiszka2019-07-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Letting this pend may cause nested_get_vmcs12_pages to run against an invalid state, corrupting the effective vmcs of L1. This was triggerable in QEMU after a guest corruption in L2, followed by a L1 reset. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 7f7f1ba33cf2 ("KVM: x86: do not load vmcs12 pages while still in SMM") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds2019-07-202-0/+31
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull dma-mapping fixes from Christoph Hellwig: "Fix various regressions: - force unencrypted dma-coherent buffers if encryption bit can't fit into the dma coherent mask (Tom Lendacky) - avoid limiting request size if swiotlb is not used (me) - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang Duan)" * tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping: dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device dma-direct: only limit the mapping size if swiotlb could be used dma-mapping: add a dma_addressing_limited helper dma-direct: Force unencrypted DMA under SME for certain DMA masks
| * | | dma-direct: Force unencrypted DMA under SME for certain DMA masksTom Lendacky2019-07-162-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a device doesn't support DMA to a physical address that includes the encryption bit (currently bit 47, so 48-bit DMA), then the DMA must occur to unencrypted memory. SWIOTLB is used to satisfy that requirement if an IOMMU is not active (enabled or configured in passthrough mode). However, commit fafadcd16595 ("swiotlb: don't dip into swiotlb pool for coherent allocations") modified the coherent allocation support in SWIOTLB to use the DMA direct coherent allocation support. When an IOMMU is not active, this resulted in dma_alloc_coherent() failing for devices that didn't support DMA addresses that included the encryption bit. Addressing this requires changes to the force_dma_unencrypted() function in kernel/dma/direct.c. Since the function is now non-trivial and SME/SEV specific, update the DMA direct support to add an arch override for the force_dma_unencrypted() function. The arch override is selected when CONFIG_AMD_MEM_ENCRYPT is set. The arch override function resides in the arch/x86/mm/mem_encrypt.c file and forces unencrypted DMA when either SEV is active or SME is active and the device does not support DMA to physical addresses that include the encryption bit. Fixes: fafadcd16595 ("swiotlb: don't dip into swiotlb pool for coherent allocations") Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> [hch: moved the force_dma_unencrypted declaration to dma-mapping.h, fold the s390 fix from Halil Pasic] Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-07-2031-195/+218
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 specific fixes and updates: - The CR2 corruption fixes which store CR2 early in the entry code and hand the stored address to the fault handlers. - Revert a forgotten leftover of the dropped FSGSBASE series. - Plug a memory leak in the boot code. - Make the Hyper-V assist functionality robust by zeroing the shadow page. - Remove a useless check for dead processes with LDT - Update paravirt and VMware maintainers entries. - A few cleanup patches addressing various compiler warnings" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/64: Prevent clobbering of saved CR2 value x86/hyper-v: Zero out the VP ASSIST PAGE on allocation x86, boot: Remove multiple copy of static function sanitize_boot_params() x86/boot/compressed/64: Remove unused variable x86/boot/efi: Remove unused variables x86/mm, tracing: Fix CR2 corruption x86/entry/64: Update comments and sanity tests for create_gap x86/entry/64: Simplify idtentry a little x86/entry/32: Simplify common_exception x86/paravirt: Make read_cr2() CALLEE_SAVE MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE x86/process: Delete useless check for dead process with LDT x86: math-emu: Hide clang warnings for 16-bit overflow x86/e820: Use proper booleans instead of 0/1 x86/apic: Silence -Wtype-limits compiler warnings x86/mm: Free sme_early_buffer after init x86/boot: Fix memory leak in default_get_smp_config() Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix the test
| * | | | x86/entry/64: Prevent clobbering of saved CR2 valueThomas Gleixner2019-07-201-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent fix for CR2 corruption introduced a new way to reliably corrupt the saved CR2 value. CR2 is saved early in the entry code in RDX, which is the third argument to the fault handling functions. But it missed that between saving and invoking the fault handler enter_from_user_mode() can be called. RDX is a caller saved register so the invoked function can freely clobber it with the obvious consequences. The TRACE_IRQS_OFF call is safe as it calls through the thunk which preserves RDX, but TRACE_IRQS_OFF_DEBUG is not because it also calls into C-code outside of the thunk. Store CR2 in R12 instead which is a callee saved register and move R12 to RDX just before calling the fault handler. Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907201020540.1782@nanos.tec.linutronix.de