summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* panic: avoid deadlocks in re-entrant console driversSergey Senozhatsky2018-12-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c7c3f05e341a9a2bd1a92993d4f996cfd6e7348e upstream. From printk()/serial console point of view panic() is special, because it may force CPU to re-enter printk() or/and serial console driver. Therefore, some of serial consoles drivers are re-entrant. E.g. 8250: serial8250_console_write() { if (port->sysrq) locked = 0; else if (oops_in_progress) locked = spin_trylock_irqsave(&port->lock, flags); else spin_lock_irqsave(&port->lock, flags); ... } panic() does set oops_in_progress via bust_spinlocks(1), so in theory we should be able to re-enter serial console driver from panic(): CPU0 <NMI> uart_console_write() serial8250_console_write() // if (oops_in_progress) // spin_trylock_irqsave() call_console_drivers() console_unlock() console_flush_on_panic() bust_spinlocks(1) // oops_in_progress++ panic() <NMI/> spin_lock_irqsave(&port->lock, flags) // spin_lock_irqsave() serial8250_console_write() call_console_drivers() console_unlock() printk() ... However, this does not happen and we deadlock in serial console on port->lock spinlock. And the problem is that console_flush_on_panic() called after bust_spinlocks(0): void panic(const char *fmt, ...) { bust_spinlocks(1); ... bust_spinlocks(0); console_flush_on_panic(); ... } bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress can go back to zero. Thus even re-entrant console drivers will simply spin on port->lock spinlock. Given that port->lock may already be locked either by a stopped CPU, or by the very same CPU we execute panic() on (for instance, NMI panic() on printing CPU) the system deadlocks and does not reboot. Fix this by removing bust_spinlocks(0), so oops_in_progress is always set in panic() now and, thus, re-entrant console drivers will trylock the port->lock instead of spinning on it forever, when we call them from console_flush_on_panic(). Link: http://lkml.kernel.org/r/20181025101036.6823-1-sergey.senozhatsky@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Wang <wonderfly@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jiri Slaby <jslaby@suse.com> Cc: Peter Feiner <pfeiner@google.com> Cc: linux-serial@vger.kernel.org Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Cure exit raceThomas Gleixner2018-12-291-6/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit da791a667536bf8322042e38ca85d55a78d3c273 upstream. Stefan reported, that the glibc tst-robustpi4 test case fails occasionally. That case creates the following race between sys_exit() and sys_futex_lock_pi(): CPU0 CPU1 sys_exit() sys_futex() do_exit() futex_lock_pi() exit_signals(tsk) No waiters: tsk->flags |= PF_EXITING; *uaddr == 0x00000PID mm_release(tsk) Set waiter bit exit_robust_list(tsk) { *uaddr = 0x80000PID; Set owner died attach_to_pi_owner() { *uaddr = 0xC0000000; tsk = get_task(PID); } if (!tsk->flags & PF_EXITING) { ... attach(); tsk->flags |= PF_EXITPIDONE; } else { if (!(tsk->flags & PF_EXITPIDONE)) return -EAGAIN; return -ESRCH; <--- FAIL } ESRCH is returned all the way to user space, which triggers the glibc test case assert. Returning ESRCH unconditionally is wrong here because the user space value has been changed by the exiting task to 0xC0000000, i.e. the FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This is a valid state and the kernel has to handle it, i.e. taking the futex. Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE is set in the task which 'owns' the futex. If the value has changed, let the kernel retry the operation, which includes all regular sanity checks and correctly handles the FUTEX_OWNER_DIED case. If it hasn't changed, then return ESRCH as there is no way to distinguish this case from malfunctioning user space. This happens when the exiting task did not have a robust list, the robust list was corrupted or the user space value in the futex was simply bogus. Reported-by: Stefan Liebler <stli@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467 Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* posix-timers: Fix division by zero bugThomas Gleixner2018-12-291-4/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0e334db6bb4b1fd1e2d72c1f3d8f004313cd9f94 upstream. The signal delivery path of posix-timers can try to rearm the timer even if the interval is zero. That's handled for the common case (hrtimer) but not for alarm timers. In that case the forwarding function raises a division by zero exception. The handling for hrtimer based posix timers is wrong because it marks the timer as active despite the fact that it is stopped. Move the check from common_hrtimer_rearm() to posixtimer_rearm() to cure both issues. Reported-by: syzbot+9d38bedac9cc77b8ad5e@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: sboyd@kernel.org Cc: stable@vger.kernel.org Cc: syzkaller-bugs@googlegroups.com Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1812171328050.1880@nanos.tec.linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: check pending signals while verifying programsAlexei Starovoitov2018-12-211-0/+3
| | | | | | | | | | | | | | | | | [ Upstream commit c3494801cd1785e2c25f1a5735fa19ddcf9665da ] Malicious user space may try to force the verifier to use as much cpu time and memory as possible. Hence check for pending signals while verifying the program. Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN, since the kernel has to release the resources used for program verification. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* locking/qspinlock, x86: Provide liveness guaranteePeter Zijlstra2018-12-211-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7aa54be2976550f17c11a1c3e3630002dea39303 upstream. On x86 we cannot do fetch_or() with a single instruction and thus end up using a cmpxchg loop, this reduces determinism. Replace the fetch_or() with a composite operation: tas-pending + load. Using two instructions of course opens a window we previously did not have. Consider the scenario: CPU0 CPU1 CPU2 1) lock trylock -> (0,0,1) 2) lock trylock /* fail */ 3) unlock -> (0,0,0) 4) lock trylock -> (0,0,1) 5) tas-pending -> (0,1,1) load-val <- (0,1,0) from 3 6) clear-pending-set-locked -> (0,0,1) FAIL: _2_ owners where 5) is our new composite operation. When we consider each part of the qspinlock state as a separate variable (as we can when _Q_PENDING_BITS == 8) then the above is entirely possible, because tas-pending will only RmW the pending byte, so the later load is able to observe prior tail and lock state (but not earlier than its own trylock, which operates on the whole word, due to coherence). To avoid this we need 2 things: - the load must come after the tas-pending (obviously, otherwise it can trivially observe prior state). - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for example, such that we cannot observe other state prior to setting pending. On x86 we can realize this by using "LOCK BTS m32, r32" for tas-pending followed by a regular load. Note that observing later state is not a problem: - if we fail to observe a later unlock, we'll simply spin-wait for that store to become visible. - if we observe a later xchg_tail(), there is no difference from that xchg_tail() having taken place before the tas-pending. Suggested-by: Will Deacon <will.deacon@arm.com> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath") Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [bigeasy: GEN_BINARY_RMWcc macro redo] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* locking/qspinlock: Re-order codePeter Zijlstra2018-12-211-29/+27Star
| | | | | | | | | | | | | | | | | | | | commit 53bf57fab7321fb42b703056a4c80fc9d986d170 upstream. Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL) such that we loose the indent. This also result in a more natural code flow IMO. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: andrea.parri@amarulasolutions.com Cc: longman@redhat.com Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Fix memory leak of instance function hash filtersSteven Rostedt (VMware)2018-12-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 2840f84f74035e5a535959d5f17269c69fa6edc5 upstream. The following commands will cause a memory leak: # cd /sys/kernel/tracing # mkdir instances/foo # echo schedule > instance/foo/set_ftrace_filter # rmdir instances/foo The reason is that the hashes that hold the filters to set_ftrace_filter and set_ftrace_notrace are not freed if they contain any data on the instance and the instance is removed. Found by kmemleak detector. Cc: stable@vger.kernel.org Fixes: 591dffdade9f ("ftrace: Allow for function tracing instance to filter functions") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix memory leak in set_trigger_filter()Steven Rostedt (VMware)2018-12-191-2/+4
| | | | | | | | | | | | | | | | | | | | | | commit 3cec638b3d793b7cacdec5b8072364b41caeb0e1 upstream. When create_event_filter() fails in set_trigger_filter(), the filter may still be allocated and needs to be freed. The caller expects the data->filter to be updated with the new filter, even if the new filter failed (we could add an error message by setting set_str parameter of create_event_filter(), but that's another update). But because the error would just exit, filter was left hanging and nothing could free it. Found by kmemleak detector. Cc: stable@vger.kernel.org Fixes: bac5fb97a173a ("tracing: Add and use generic set_trigger_filter() implementation") Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix memory leak in create_filter()Steven Rostedt (VMware)2018-12-191-1/+4
| | | | | | | | | | | | | | | | | | | commit b61c19209c2c35ea2a2fe502d484703686eba98c upstream. The create_filter() calls create_filter_start() which allocates a "parse_error" descriptor, but fails to call create_filter_finish() that frees it. The op_stack and inverts in predicate_parse() were also not freed. Found by kmemleak detector. Cc: stable@vger.kernel.org Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster") Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/pelt: Fix warning and clean up IRQ PELT configVincent Guittot2018-12-195-10/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 11d4afd4ff667f9b6178ee8c142c36cb78bd84db upstream. Create a config for enabling irq load tracking in the scheduler. irq load tracking is useful only when irq or paravirtual time is accounted but it's only possible with SMP for now. Also use __maybe_unused to remove the compilation warning in update_rq_clock_task() that has been introduced by: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Suggested-by: Ingo Molnar <mingo@redhat.com> Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: dou_liyang@163.com Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: fix off-by-one error in adjust_subprog_startsEdward Cree2018-12-171-1/+1
| | | | | | | | | | | | | | | | | commit afd594240806acc138cf696c09f2f4829d55d02f upstream. When patching in a new sequence for the first insn of a subprog, the start of that subprog does not change (it's the first insn of the sequence), so adjust_subprog_starts should check start <= off (rather than < off). Also added a test to test_verifier.c (it's essentially the syz reproducer). Fixes: cc8b0b92a169 ("bpf: introduce function calls (function boundaries)") Reported-by: syzbot+4fc427c7af994b0948be@syzkaller.appspotmail.com Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notraceAnders Roxell2018-12-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 903e8ff86753e6f327bb92166a0665e4ecb8e2e7 ] Since __sanitizer_cov_trace_pc() is marked as notrace, function calls in __sanitizer_cov_trace_pc() shouldn't be traced either. ftrace_graph_caller() gets called for each function that isn't marked 'notrace', like canonicalize_ip(). This is the call trace from a run: [ 139.644550] ftrace_graph_caller+0x1c/0x24 [ 139.648352] canonicalize_ip+0x18/0x28 [ 139.652313] __sanitizer_cov_trace_pc+0x14/0x58 [ 139.656184] sched_clock+0x34/0x1e8 [ 139.659759] trace_clock_local+0x40/0x88 [ 139.663722] ftrace_push_return_trace+0x8c/0x1f0 [ 139.667767] prepare_ftrace_return+0xa8/0x100 [ 139.671709] ftrace_graph_caller+0x1c/0x24 Rework so that check_kcov_mode() and canonicalize_ip() that are called from __sanitizer_cov_trace_pc() are also marked as notrace. Link: http://lkml.kernel.org/r/20181128081239.18317-1-anders.roxell@linaro.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signen-off-by: Anders Roxell <anders.roxell@linaro.org> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: fix check of allowed specifiers in bpf_trace_printkMartynas Pumputis2018-12-171-3/+5
| | | | | | | | | | | | | | [ Upstream commit 1efb6ee3edea57f57f9fb05dba8dcb3f7333f61f ] A format string consisting of "%p" or "%s" followed by an invalid specifier (e.g. "%p%\n" or "%s%") could pass the check which would make format_decode (lib/vsprintf.c) to warn. Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()") Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: allocate local storage buffers using GFP_ATOMICRoman Gushchin2018-12-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 569a933b03f3c48b392fe67c0086b3a6b9306b5a ] Naresh reported an issue with the non-atomic memory allocation of cgroup local storage buffers: [ 73.047526] BUG: sleeping function called from invalid context at /srv/oe/build/tmp-rpb-glibc/work-shared/intel-corei7-64/kernel-source/mm/slab.h:421 [ 73.060915] in_atomic(): 1, irqs_disabled(): 0, pid: 3157, name: test_cgroup_sto [ 73.068342] INFO: lockdep is turned off. [ 73.072293] CPU: 2 PID: 3157 Comm: test_cgroup_sto Not tainted 4.20.0-rc2-next-20181113 #1 [ 73.080548] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0b 07/27/2017 [ 73.088018] Call Trace: [ 73.090463] dump_stack+0x70/0xa5 [ 73.093783] ___might_sleep+0x152/0x240 [ 73.097619] __might_sleep+0x4a/0x80 [ 73.101191] __kmalloc_node+0x1cf/0x2f0 [ 73.105031] ? cgroup_storage_update_elem+0x46/0x90 [ 73.109909] cgroup_storage_update_elem+0x46/0x90 cgroup_storage_update_elem() (as well as other update map update callbacks) is called with disabled preemption, so GFP_ATOMIC allocation should be used: e.g. alloc_htab_elem() in hashtab.c. Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing/fgraph: Fix set_graph_function from showing interruptsSteven Rostedt (VMware)2018-12-084-3/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5cf99a0f3161bc3ae2391269d134d6bf7e26f00e upstream. The tracefs file set_graph_function is used to only function graph functions that are listed in that file (or all functions if the file is empty). The way this is implemented is that the function graph tracer looks at every function, and if the current depth is zero and the function matches something in the file then it will trace that function. When other functions are called, the depth will be greater than zero (because the original function will be at depth zero), and all functions will be traced where the depth is greater than zero. The issue is that when a function is first entered, and the handler that checks this logic is called, the depth is set to zero. If an interrupt comes in and a function in the interrupt handler is traced, its depth will be greater than zero and it will automatically be traced, even if the original function was not. But because the logic only looks at depth it may trace interrupts when it should not be. The recent design change of the function graph tracer to fix other bugs caused the depth to be zero while the function graph callback handler is being called for a longer time, widening the race of this happening. This bug was actually there for a longer time, but because the race window was so small it seldom happened. The Fixes tag below is for the commit that widen the race window, because that commit belongs to a series that will also help fix the original bug. Cc: stable@kernel.org Fixes: 39eb456dacb5 ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack") Reported-by: Joe Lawrence <joe.lawrence@redhat.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* uprobes: Fix handle_swbp() vs. unregister() + register() race once moreAndrea Parri2018-12-081-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 09d3f015d1e1b4fee7e9bbdcf54201d239393391 upstream. Commit: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race") added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb() memory barriers, to ensure that handle_swbp() uses fully-initialized uprobes only. However, the smp_rmb() is mis-placed: this barrier should be placed after handle_swbp() has tested for the flag, thus guaranteeing that (program-order) subsequent loads from the uprobe can see the initial stores performed by prepare_uprobe(). Move the smp_rmb() accordingly. Also amend the comments associated to the two memory barriers to indicate their actual locations. Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: stable@kernel.org Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race") Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Reverse the order of pushing the ret_stack and the callbackSteven Rostedt (VMware)2018-12-051-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | commit 7c6ea35ef50810aa12ab26f21cb858d980881576 upstream. The function graph profiler uses the ret_stack to store the "subtime" and reuse it by nested functions and also on the return. But the current logic has the profiler callback called before the ret_stack is updated, and it is just modifying the ret_stack that will later be allocated (it's just lucky that the "subtime" is not touched when it is allocated). This could also cause a crash if we are at the end of the ret_stack when this happens. By reversing the order of the allocating the ret_stack and then calling the callbacks attached to a function being traced, the ret_stack entry is no longer used before it is allocated. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Move return callback before update of curr_ret_stackSteven Rostedt (VMware)2018-12-051-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 552701dd0fa7c3d448142e87210590ba424694a0 upstream. In the past, curr_ret_stack had two functions. One was to denote the depth of the call graph, the other is to keep track of where on the ret_stack the data is used. Although they may be slightly related, there are two cases where they need to be used differently. The one case is that it keeps the ret_stack data from being corrupted by an interrupt coming in and overwriting the data still in use. The other is just to know where the depth of the stack currently is. The function profiler uses the ret_stack to save a "subtime" variable that is part of the data on the ret_stack. If curr_ret_stack is modified too early, then this variable can be corrupted. The "max_depth" option, when set to 1, will record the first functions going into the kernel. To see all top functions (when dealing with timings), the depth variable needs to be lowered before calling the return hook. But by lowering the curr_ret_stack, it makes the data on the ret_stack still being used by the return hook susceptible to being overwritten. Now that there's two variables to handle both cases (curr_ret_depth), we can move them to the locations where they can handle both cases. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Have profiler use curr_ret_stack and not depthSteven Rostedt (VMware)2018-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | commit b1b35f2e218a5b57d03bbc3b0667d5064570dc60 upstream. The profiler uses trace->depth to find its entry on the ret_stack, but the depth may not match the actual location of where its entry is (if an interrupt were to preempt the processing of the profiler for another function, the depth and the curr_ret_stack will be different). Have it use the curr_ret_stack as the index to find its ret_stack entry instead of using the depth variable, as that is no longer guaranteed to be the same. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stackSteven Rostedt (VMware)2018-12-052-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 39eb456dacb543de90d3bc6a8e0ac5cf51ac475e upstream. Currently, the depth of the ret_stack is determined by curr_ret_stack index. The issue is that there's a race between setting of the curr_ret_stack and calling of the callback attached to the return of the function. Commit 03274a3ffb44 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") moved the calling of the callback to after the setting of the curr_ret_stack, even stating that it was safe to do so, when in fact, it was the reason there was a barrier() there (yes, I should have commented that barrier()). Not only does the curr_ret_stack keep track of the current call graph depth, it also keeps the ret_stack content from being overwritten by new data. The function profiler, uses the "subtime" variable of ret_stack structure and by moving the curr_ret_stack, it allows for interrupts to use the same structure it was using, corrupting the data, and breaking the profiler. To fix this, there needs to be two variables to handle the call stack depth and the pointer to where the ret_stack is being used, as they need to change at two different locations. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Make ftrace_push_return_trace() staticSteven Rostedt (VMware)2018-12-051-1/+1
| | | | | | | | | | | | | | | | | commit d125f3f866df88da5a85df00291f88f0baa89f7c upstream. As all architectures now call function_graph_enter() to do the entry work, no architecture should ever call ftrace_push_return_trace(). Make it static. This is needed to prepare for a fix of a design bug on how the curr_ret_stack is used. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* function_graph: Create function_graph_enter() to consolidate architecture codeSteven Rostedt (VMware)2018-12-051-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | commit 8114865ff82e200b383e46821c25cb0625b842b5 upstream. Currently all the architectures do basically the same thing in preparing the function graph tracer on entry to a function. This code can be pulled into a generic location and then this will allow the function graph tracer to be fixed, as well as extended. Create a new function graph helper function_graph_enter() that will call the hook function (ftrace_graph_entry) and the shadow stack operation (ftrace_push_return_trace), and remove the need of the architecture code to manage the shadow stack. This is needed to prepare for a fix of a design bug on how the curr_ret_stack is used. Cc: stable@kernel.org Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback") Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRSThomas Gleixner2018-12-051-10/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 46f7ecb1e7359f183f5bbd1e08b90e10e52164f9 upstream The IBPB control code in x86 removed the usage. Remove the functionality which was introduced for this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Rework SMT state changeThomas Gleixner2018-12-051-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a74cfffb03b73d41e08f84c2e5c87dec0ce3db9f upstream arch_smt_update() is only called when the sysfs SMT control knob is changed. This means that when SMT is enabled in the sysfs control knob the system is considered to have SMT active even if all siblings are offline. To allow finegrained control of the speculation mitigations, the actual SMT state is more interesting than the fact that siblings could be enabled. Rework the code, so arch_smt_update() is invoked from each individual CPU hotplug function, and simplify the update function while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/smt: Expose sched_smt_present static keyThomas Gleixner2018-12-051-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream Make the scheduler's 'sched_smt_present' static key globaly available, so it can be used in the x86 speculation control code. Provide a query function and a stub for the CONFIG_SMP=n case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/smt: Make sched_smt_present track topologyPeter Zijlstra (Intel)2018-12-051-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c5511d03ec090980732e929c318a7a6374b5550e upstream Currently the 'sched_smt_present' static key is enabled when at CPU bringup SMT topology is observed, but it is never disabled. However there is demand to also disable the key when the topology changes such that there is no SMT present anymore. Implement this by making the key count the number of cores that have SMT enabled. In particular, the SMT topology bits are set before interrrupts are enabled and similarly, are cleared after interrupts are disabled for the last time and the CPU dies. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Apply IBPB more strictly to avoid cross-process data leakJiri Kosina2018-12-051-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream Currently, IBPB is only issued in cases when switching into a non-dumpable process, the rationale being to protect such 'important and security sensitive' processess (such as GPG) from data leaking into a different userspace process via spectre v2. This is however completely insufficient to provide proper userspace-to-userpace spectrev2 protection, as any process can poison branch buffers before being scheduled out, and the newly scheduled process immediately becomes spectrev2 victim. In order to minimize the performance impact (for usecases that do require spectrev2 protection), issue the barrier only in cases when switching between processess where the victim can't be ptraced by the potential attacker (as in such cases, the attacker doesn't have to bother with branch buffers at all). [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably fine-grained ] Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch") Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigationJiri Kosina2018-12-051-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 53c613fe6349994f023245519265999eed75957f upstream STIBP is a feature provided by certain Intel ucodes / CPUs. This feature (once enabled) prevents cross-hyperthread control of decisions made by indirect branch predictors. Enable this feature if - the CPU is vulnerable to spectre v2 - the CPU supports SMT and has SMT siblings online - spectre_v2 mitigation autoselection is enabled (default) After some previous discussion, this leaves STIBP on all the time, as wrmsr on crossing kernel boundary is a no-no. This could perhaps later be a bit more optimized (like disabling it in NOHZ, experiment with disabling it in idle, etc) if needed. Note that the synchronization of the mask manipulation via newly added spec_ctrl_mutex is currently not strictly needed, as the only updater is already being serialized by cpu_add_remove_lock, but let's make this a little bit more future-proof. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rcu: Make need_resched() respond to urgent RCU-QS needsPaul E. McKenney2018-12-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 92aa39e9dc77481b90cbef25e547d66cab901496 upstream. The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent need for an RCU quiescent state from the force-quiescent-state processing within the grace-period kthread to context switches and to cond_resched(). Unfortunately, such urgent needs are not communicated to need_resched(), which is sometimes used to decide when to invoke cond_resched(), for but one example, within the KVM vcpu_run() function. As of v4.15, this can result in synchronize_sched() being delayed by up to ten seconds, which can be problematic, to say nothing of annoying. This commit therefore checks rcu_dynticks.rcu_urgent_qs from within rcu_check_callbacks(), which is invoked from the scheduling-clock interrupt handler. If the current task is not an idle task and is not executing in usermode, a context switch is forced, and either way, the rcu_dynticks.rcu_urgent_qs variable is set to false. If the current task is an idle task, then RCU's dyntick-idle code will detect the quiescent state, so no further action is required. Similarly, if the task is executing in usermode, other code in rcu_check_callbacks() and its called functions will report the corresponding quiescent state. Reported-by: Marius Hillenbrand <mhillenb@amazon.de> Reported-by: David Woodhouse <dwmw2@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Backported to make patch apply cleanly on older versions. ] Tested-by: Marius Hillenbrand <mhillenb@amazon.de> Cc: <stable@vger.kernel.org> # 4.12.x - 4.19.x Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kdb: Use strscpy with destination buffer sizePrarit Bhargava2018-12-013-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c2b94c72d93d0929f48157eef128c4f9d2e603ce ] gcc 8.1.0 warns with: kernel/debug/kdb/kdb_support.c: In function ‘kallsyms_symbol_next’: kernel/debug/kdb/kdb_support.c:239:4: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=] strncpy(prefix_name, name, strlen(name)+1); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/debug/kdb/kdb_support.c:239:31: note: length computed here Use strscpy() with the destination buffer size, and use ellipses when displaying truncated symbols. v2: Use strscpy() Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jonathan Toppins <jtoppins@redhat.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: kgdb-bugreport@lists.sourceforge.net Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/fair: Fix cpu_util_wake() for 'execl' type workloadsPatrick Bellasi2018-12-011-14/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ] A ~10% regression has been reported for UnixBench's execl throughput test by Aaron Lu and Ye Xiaolong: https://lkml.org/lkml/2018/10/30/765 That test is pretty simple, it does a "recursive" execve() syscall on the same binary. Starting from the syscall, this sequence is possible: do_execve() do_execveat_common() __do_execve_file() sched_exec() select_task_rq_fair() <==| Task already enqueued find_idlest_cpu() find_idlest_group() capacity_spare_wake() <==| Functions not called from cpu_util_wake() | the wakeup path which means we can end up calling cpu_util_wake() not only from the "wakeup path", as its name would suggest. Indeed, the task doing an execve() syscall is already enqueued on the CPU we want to get the cpu_util_wake() for. The estimated utilization for a CPU computed in cpu_util_wake() was written under the assumption that function can be called only from the wakeup path. If instead the task is already enqueued, we end up with a utilization which does not remove the current task's contribution from the estimated utilization of the CPU. This will wrongly assume a reduced spare capacity on the current CPU and increase the chances to migrate the task on execve. The regression is tracked down to: commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") because in that patch we turn on by default the UTIL_EST sched feature. However, the real issue is introduced by: commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths") Let's fix this by ensuring to always discount the task estimated utilization from the CPU's estimated utilization when the task is also the current one. The same benchmark of the bug report, executed on a dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine, reports these "Execl Throughput" figures (higher the better): mainline : 48136.5 lps mainline+fix : 55376.5 lps which correspond to a 15% speedup. Moreover, since {cpu_util,capacity_spare}_wake() are not really only used from the wakeup path, let's remove this ambiguity by using a better matching name: {cpu_util,capacity_spare}_without(). Since we are at that, let's also improve the existing documentation. Reported-by: Aaron Lu <aaron.lu@intel.com> Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths) Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/ Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/core: Take the hotplug lock in sched_init_smp()Valentin Schneider2018-11-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ] When running on linux-next (8c60c36d0b8c ("Add linux-next specific files for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g. Juno or HiKey960), we get the following report: [ 0.748225] Call trace: [ 0.750685] lockdep_assert_cpus_held+0x30/0x40 [ 0.755236] static_key_enable_cpuslocked+0x20/0xc8 [ 0.760137] build_sched_domains+0x1034/0x1108 [ 0.764601] sched_init_domains+0x68/0x90 [ 0.768628] sched_init_smp+0x30/0x80 [ 0.772309] kernel_init_freeable+0x278/0x51c [ 0.776685] kernel_init+0x10/0x108 [ 0.780190] ret_from_fork+0x10/0x18 The static_key in question is 'sched_asym_cpucapacity' introduced by commit: df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations") In this particular case, we enable it because smp_prepare_cpus() will end up fetching the capacity-dmips-mhz entry from the devicetree, so we already have some asymmetry detected when entering sched_init_smp(). This didn't get detected in tip/sched/core because we were missing: commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations") Calls to build_sched_domains() post sched_init_smp() will hold the hotplug lock, it just so happens that this very first call is a special case. As stated by a comment in sched_init_smp(), "There's no userspace yet to cause hotplug operations" so this is a harmless warning. However, to both respect the semantics of underlying callees and make lockdep happy, take the hotplug lock in sched_init_smp(). This also satisfies the comment atop sched_init_domains() that says "Callers must hold the hotplug lock". Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: morten.rasmussen@arm.com Cc: quentin.perret@arm.com Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unprivDaniel Borkmann2018-11-271-0/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 28c2fae726bf5003cd209b0d5910a642af98316f ] While dbecd7388476 ("bpf: get kernel symbol addresses via syscall") zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries from unprivileged users, commit 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall") forgot about doing so and therefore returns the #elems of the user set up buffer which is incorrect. It also needs to indicate a info.nr_jited_func_lens of zero. Fixes: 815581c11cc2 ("bpf: get JITed image lengths of functions via syscall") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Sandipan Das <sandipan@linux.vnet.ibm.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Revert "x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation"Greg Kroah-Hartman2018-11-231-10/+1Star
| | | | | | | | | | | | | | | | | | | This reverts commit 233b9d7df0e114c7e7c3674559fb0fc41ada3e8f which is commit 53c613fe6349994f023245519265999eed75957f upstream. It's not ready for the stable trees as there are major slowdowns involved with this patch. Reported-by: Jiri Kosina <jkosina@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kdb: print real address of pointers instead of hashed addressesChristophe Leroy2018-11-212-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 568fb6f42ac6851320adaea25f8f1b94de14e40a upstream. Since commit ad67b74d2469 ("printk: hash addresses printed with %p"), all pointers printed with %p are printed with hashed addresses instead of real addresses in order to avoid leaking addresses in dmesg and syslog. But this applies to kdb too, with is unfortunate: Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry kdb> ps 15 sleeping system daemon (state M) processes suppressed, use 'ps A' to see all. Task Addr Pid Parent [*] cpu State Thread Command 0x(ptrval) 329 328 1 0 R 0x(ptrval) *sh 0x(ptrval) 1 0 0 0 S 0x(ptrval) init 0x(ptrval) 3 2 0 0 D 0x(ptrval) rcu_gp 0x(ptrval) 4 2 0 0 D 0x(ptrval) rcu_par_gp 0x(ptrval) 5 2 0 0 D 0x(ptrval) kworker/0:0 0x(ptrval) 6 2 0 0 D 0x(ptrval) kworker/0:0H 0x(ptrval) 7 2 0 0 D 0x(ptrval) kworker/u2:0 0x(ptrval) 8 2 0 0 D 0x(ptrval) mm_percpu_wq 0x(ptrval) 10 2 0 0 D 0x(ptrval) rcu_preempt The whole purpose of kdb is to debug, and for debugging real addresses need to be known. In addition, data displayed by kdb doesn't go into dmesg. This patch replaces all %p by %px in kdb in order to display real addresses. Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Cc: <stable@vger.kernel.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kdb: use correct pointer when 'btc' calls 'btt'Christophe Leroy2018-11-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit dded2e159208a9edc21dd5c5f583afa28d378d39 upstream. On a powerpc 8xx, 'btc' fails as follows: Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry kdb> btc btc: cpu status: Currently on cpu 0 Available cpus: 0 kdb_getarea: Bad address 0x0 when booting the kernel with 'debug_boot_weak_hash', it fails as well Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry kdb> btc btc: cpu status: Currently on cpu 0 Available cpus: 0 kdb_getarea: Bad address 0xba99ad80 On other platforms, Oopses have been observed too, see https://github.com/linuxppc/linux/issues/139 This is due to btc calling 'btt' with %p pointer as an argument. This patch replaces %p by %px to get the real pointer value as expected by 'btt' Fixes: ad67b74d2469 ("printk: hash addresses printed with %p") Cc: <stable@vger.kernel.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing/kprobes: Check the probe on unloaded module correctlyMasami Hiramatsu2018-11-211-13/+26
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 59158ec4aef7d44be51a6f3e7e17fc64c32604eb ] Current kprobe event doesn't checks correctly whether the given event is on unloaded module or not. It just checks the event has ":" in the name. That is not enough because if we define a probe on non-exist symbol on loaded module, it allows to define that (with warning message) To ensure it correctly, this searches the module name on loaded module list and only if there is not, it allows to define it. (this event will be available when the target module is loaded) Link: http://lkml.kernel.org/r/153547309528.26502.8300278470528281328.stgit@devbox Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: wait for running BPF programs when updating map-in-mapDaniel Colascione2018-11-131-0/+13
| | | | | | | | | | | | | | | | | | | | | commit 1ae80cf31938c8f77c37a29bbe29e7f1cd492be8 upstream. The map-in-map frequently serves as a mechanism for atomic snapshotting of state that a BPF program might record. The current implementation is dangerous to use in this way, however, since userspace has no way of knowing when all programs that might have retrieved the "old" value of the map may have completed. This change ensures that map update operations on map-in-map map types always wait for all references to the old map to drop before returning to userspace. Signed-off-by: Daniel Colascione <dancol@google.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Chenbo Feng <fengc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: also map extents in the reverse map to kernel IDsJann Horn2018-11-131-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d2f007dbe7e4c9583eea6eb04d60001e85c6f1bd upstream. The current logic first clones the extent array and sorts both copies, then maps the lower IDs of the forward mapping into the lower namespace, but doesn't map the lower IDs of the reverse mapping. This means that code in a nested user namespace with >5 extents will see incorrect IDs. It also breaks some access checks, like inode_owner_or_capable() and privileged_wrt_inode_uidgid(), so a process can incorrectly appear to be capable relative to an inode. To fix it, we have to make sure that the "lower_first" members of extents in both arrays are translated; and we have to make sure that the reverse map is sorted *after* the translation (since otherwise the translation can break the sorting). This is CVE-2018-18955. Fixes: 6397fac4915a ("userns: bump idmap limits to 340") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Tested-by: Eric W. Biederman <ebiederm@xmission.com> Reviewed-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Return -ENOENT if there is no target synthetic eventMasami Hiramatsu2018-11-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | commit 18858511fd8a877303cc34c06efa461b26a0e070 upstream. Return -ENOENT error if there is no target synthetic event. This notices an operation failure to user as below; # echo 'wakeup_latency u64 lat; pid_t pid;' > synthetic_events # echo '!wakeup' >> synthetic_events sh: write error: No such file or directory Link: http://lkml.kernel.org/r/154013449986.25576.9487131386597290172.stgit@devbox Acked-by: Tom Zanussi <zanussi@linux.intel.com> Tested-by: Tom Zanussi <zanussi@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rajvi Jingar <rajvi.jingar@intel.com> Cc: stable@vger.kernel.org Fixes: 4b147936fa50 ('tracing: Add support for 'synthetic' events') Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* genirq: Fix race on spurious interrupt detectionLukas Wunner2018-11-131-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 746a923b863a1065ef77324e1e43f19b1a3eab5c upstream. Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs") made detection of spurious interrupts work for threaded handlers by: a) incrementing a counter every time the thread returns IRQ_HANDLED, and b) checking whether that counter has increased every time the thread is woken. However for oneshot interrupts, the commit unmasks the interrupt before incrementing the counter. If another interrupt occurs right after unmasking but before the counter is incremented, that interrupt is incorrectly considered spurious: time | irq_thread() | irq_thread_fn() | action->thread_fn() | irq_finalize_oneshot() | unmask_threaded_irq() /* interrupt is unmasked */ | | /* interrupt fires, incorrectly deemed spurious */ | | atomic_inc(&desc->threads_handled); /* counter is incremented */ v This is observed with a hi3110 CAN controller receiving data at high volume (from a separate machine sending with "cangen -g 0 -i -x"): The controller signals a huge number of interrupts (hundreds of millions per day) and every second there are about a dozen which are deemed spurious. In theory with high CPU load and the presence of higher priority tasks, the number of incorrectly detected spurious interrupts might increase beyond the 99,900 threshold and cause disablement of the interrupt. In practice it just increments the spurious interrupt count. But that can cause people to waste time investigating it over and over. Fix it by moving the accounting before the invocation of irq_finalize_oneshot(). [ tglx: Folded change log update ] Fixes: 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs") Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mathias Duckeck <m.duckeck@kunbus.de> Cc: Akshay Bhat <akshay.bhat@timesys.com> Cc: Casey Fitzpatrick <casey.fitzpatrick@timesys.com> Cc: stable@vger.kernel.org # v3.16+ Link: https://lkml.kernel.org/r/1dfd8bbd16163940648045495e3e9698e63b50ad.1539867047.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* printk: Fix panic caused by passing log_buf_len to command lineHe Zhe2018-11-131-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 277fcdb2cfee38ccdbe07e705dbd4896ba0c9930 upstream. log_buf_len_setup does not check input argument before passing it to simple_strtoull. The argument would be a NULL pointer if "log_buf_len", without its value, is set in command line and thus causes the following panic. PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0 [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1 [ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70 ... [ 0.000000] Call Trace: [ 0.000000] simple_strtoull+0x29/0x70 [ 0.000000] memparse+0x26/0x90 [ 0.000000] log_buf_len_setup+0x17/0x22 [ 0.000000] do_early_param+0x57/0x8e [ 0.000000] parse_args+0x208/0x320 [ 0.000000] ? rdinit_setup+0x30/0x30 [ 0.000000] parse_early_options+0x29/0x2d [ 0.000000] ? rdinit_setup+0x30/0x30 [ 0.000000] parse_early_param+0x36/0x4d [ 0.000000] setup_arch+0x336/0x99e [ 0.000000] start_kernel+0x6f/0x4ee [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x6f/0x72 [ 0.000000] secondary_startup_64+0xa4/0xb0 This patch adds a check to prevent the panic. Link: http://lkml.kernel.org/r/1538239553-81805-1-git-send-email-zhe.he@windriver.com Cc: stable@vger.kernel.org Cc: rostedt@goodmis.org Cc: linux-kernel@vger.kernel.org Signed-off-by: He Zhe <zhe.he@windriver.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kbuild: fix kernel/bounds.c 'W=1' warningArnd Bergmann2018-11-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6a32c2469c3fbfee8f25bcd20af647326650a6cf upstream. Building any configuration with 'make W=1' produces a warning: kernel/bounds.c:16:6: warning: no previous prototype for 'foo' [-Wmissing-prototypes] When also passing -Werror, this prevents us from building any other files. Nobody ever calls the function, but we can't make it 'static' either since we want the compiler output. Calling it 'main' instead however avoids the warning, because gcc does not insist on having a declaration for main. Link: http://lkml.kernel.org/r/20181005083313.2088252-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Reviewed-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* signal: Guard against negative signal numbers in copy_siginfo_from_user32Eric W. Biederman2018-11-131-1/+1
| | | | | | | | | | | | | | | | | | | commit a36700589b85443e28170be59fa11c8a104130a5 upstream. While fixing an out of bounds array access in known_siginfo_layout reported by the kernel test robot it became apparent that the same bug exists in siginfo_layout and affects copy_siginfo_from_user32. The straight forward fix that makes guards against making this mistake in the future and should keep the code size small is to just take an unsigned signal number instead of a signed signal number, as I did to fix known_siginfo_layout. Cc: stable@vger.kernel.org Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid namespace initEric W. Biederman2018-11-131-1/+1
| | | | | | | | | | | | | | | | | [ Upstream commit 3597dfe01d12f570bc739da67f857fd222a3ea66 ] Instead of playing whack-a-mole and changing SEND_SIG_PRIV to SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init gets signals sent by the kernel, stop allowing a pid namespace init to ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is only supposed to be able to ignore signals sent from itself and children with SIG_DFL. Fixes: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf/verifier: fix verifier instabilityAlexei Starovoitov2018-11-131-8/+8
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a9c676bc8fc58d00eea9836fb14ee43c0346416a ] Edward Cree says: In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access() has supplied a reg_type, the other members of the register state are set appropriately. Previously reg.range was set to 0, but as it is in a union with reg.map_ptr, which is larger, upper bytes of the latter were left in place. This then caused the memcmp() in regsafe() to fail, preventing some branches from being pruned (and occasionally causing the same program to take a varying number of processed insns on repeated verifier runs). Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known() Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Debugged-by: Edward Cree <ecree@solarflare.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()Masami Hiramatsu2018-11-131-7/+20
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 819319fc93461c07b9cdb3064f154bd8cfd48172 ] Make reuse_unused_kprobe() to return error code if it fails to reuse unused kprobe for optprobe instead of calling BUG_ON(). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* signal: Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstackWill Deacon2018-11-131-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 22839869f21ab3850fbbac9b425ccc4c0023926f ] The sigaltstack(2) system call fails with -ENOMEM if the new alternative signal stack is found to be smaller than SIGMINSTKSZ. On architectures such as arm64, where the native value for SIGMINSTKSZ is larger than the compat value, this can result in an unexpected error being reported to a compat task. See, for example: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385 This patch fixes the problem by extending do_sigaltstack to take the minimum signal stack size as an additional parameter, allowing the native and compat system call entry code to pass in their respective values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not been defined by the architecture. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Reported-by: Steve McIntyre <steve.mcintyre@arm.com> Tested-by: Steve McIntyre <93sam@debian.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/lockdep: Fix debug_locks off performance problemWaiman Long2018-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9506a7425b094d2f1d9c877ed5a78f416669269b ] It was found that when debug_locks was turned off because of a problem found by the lockdep code, the system performance could drop quite significantly when the lock_stat code was also configured into the kernel. For instance, parallel kernel build time on a 4-socket x86-64 server nearly doubled. Further analysis into the cause of the slowdown traced back to the frequent call to debug_locks_off() from the __lock_acquired() function probably due to some inconsistent lockdep states with debug_locks off. The debug_locks_off() function did an unconditional atomic xchg to write a 0 value into debug_locks which had already been set to 0. This led to severe cacheline contention in the cacheline that held debug_locks. As debug_locks is being referenced in quite a few different places in the kernel, this greatly slow down the system performance. To prevent that trashing of debug_locks cacheline, lock_acquired() and lock_contended() now checks the state of debug_locks before proceeding. The debug_locks_off() function is also modified to check debug_locks before calling __debug_locks_off(). Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigationJiri Kosina2018-11-131-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 53c613fe6349994f023245519265999eed75957f upstream. STIBP is a feature provided by certain Intel ucodes / CPUs. This feature (once enabled) prevents cross-hyperthread control of decisions made by indirect branch predictors. Enable this feature if - the CPU is vulnerable to spectre v2 - the CPU supports SMT and has SMT siblings online - spectre_v2 mitigation autoselection is enabled (default) After some previous discussion, this leaves STIBP on all the time, as wrmsr on crossing kernel boundary is a no-no. This could perhaps later be a bit more optimized (like disabling it in NOHZ, experiment with disabling it in idle, etc) if needed. Note that the synchronization of the mask manipulation via newly added spec_ctrl_mutex is currently not strictly needed, as the only updater is already being serialized by cpu_add_remove_lock, but let's make this a little bit more future-proof. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "WoodhouseDavid" <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: "SchauflerCasey" <casey.schaufler@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>