summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* mm: use VM_BUG_ON_MM where possibleSasha Levin2014-10-102-3/+2Star
| | | | | | | | Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mempolicy: remove the "task" arg of vma_policy_mof() and simplify itOleg Nesterov2014-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | 1. vma_policy_mof(task) is simply not safe unless task == current, it can race with do_exit()->mpol_put(). Remove this arg and update its single caller. 2. vma can not be NULL, remove this check and simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove noisy remainder of the scan_unevictable interfaceJohannes Weiner2014-10-101-7/+0Star
| | | | | | | | | | | | | | | | The deprecation warnings for the scan_unevictable interface triggers by scripts doing `sysctl -a | grep something else'. This is annoying and not helpful. The interface has been defunct since 264e56d8247e ("mm: disable user interface to manually rescue unevictable pages"), which was in 2011, and there haven't been any reports of usecases for it, only reports that the deprecation warnings are annying. It's unlikely that anybody is using this interface specifically at this point, so remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operationCyrill Gorcunov2014-10-101-1/+189
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps and /proc/$pid/smaps output to find a symbolic name "heap" for VMA being scanned; - @start_stack is printed out in /proc/$pid/stat and used to find a symbolic name "stack" for task and threads in /proc/$pid/maps and /proc/$pid/smaps output, and as the same as with @start_brk -- perf system uses it for event naming. Also kernel treat this member as a start address of where to map vDSO pages and to check if there is enough space for shmat() syscall; - @arg_start, @arg_end, @env_start and @env_end are printed out in /proc/$pid/stat. Another access to the data these members represent is to read /proc/$pid/environ or /proc/$pid/cmdline. Any attempt to read these areas kernel tests with access_process_vm helper so a user must have enough rights for this action; - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly speaking kernel doesn't care much about which exactly data is sitting there because it is solely for userspace; - @exe_fd is referred from /proc/$pid/exe and when generating coredump. We uses prctl_set_mm_exe_file_locked helper to update this member, so exe-file link modification remains one-shot action. Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. To test if new interface is implemented in the kernel one can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently supported struct prctl_mm_map. [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: Andrew Vagin <avagin@openvz.org> Tested-by: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_fileCyrill Gorcunov2014-10-101-10/+11
| | | | | | | | | | | | | | | | | | | | | Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out and rename the helper to prctl_set_mm_exe_file_locked(). This will allow to reuse this function in a next patch. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use may_adjust_brk helperCyrill Gorcunov2014-10-101-7/+4Star
| | | | | | | | | | | | | | | | | Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel/kthread.c: partial revert of 81c98869faa5 ("kthread: ensure locality ↵Nishanth Aravamudan2014-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | of task_struct allocations") After discussions with Tejun, we don't want to spread the use of cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into callers, but would rather ensure the callees correctly handle memoryless nodes. With the previous patches ("topology: add support for node_to_mem_node() to determine the fallback node" and "slub: fallback to node_to_mem_node() node if allocating on memoryless node") adding and using node_to_mem_node(), we can safely undo part of the change to the kthread logic from 81c98869faa5. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* softlockup: make detector be aware of task switch of processes hogging cpuchai wen2014-10-101-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | For now, soft lockup detector warns once for each case of process softlockup. But the thread 'watchdog/n' may not always get the cpu at the time slot between the task switch of two processes hogging that cpu to reset soft_watchdog_warn. An example would be two processes hogging the cpu. Process A causes the softlockup warning and is killed manually by a user. Process B immediately becomes the new process hogging the cpu preventing the softlockup code from resetting the soft_watchdog_warn variable. This case is a false negative of "warn only once for a process", as there may be a different process that is going to hog the cpu. Resolve this by saving/checking the task pointer of the hogging process and use that to reset soft_watchdog_warn too. [dzickus@redhat.com: update comment] Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'irq-core-for-linus' of ↵Linus Torvalds2014-10-092-0/+45
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq departement delivers: - a cleanup series to get rid of mindlessly copied code. - another bunch of new pointlessly different interrupt chip drivers. Adding homebrewn irq chips (and timers) to SoCs must provide a value add which is beyond the imagination of mere mortals. - the usual SoC irq controller updates, IOW my second cat herding project" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) irqchip: gic-v3: Implement CPU PM notifier irqchip: gic-v3: Refactor gic_enable_redist to support both enabling and disabling irqchip: renesas-intc-irqpin: Add minimal runtime PM support irqchip: renesas-intc-irqpin: Add helper variable dev = &pdev->dev irqchip: atmel-aic5: Add sama5d4 support irqchip: atmel-aic5: The sama5d3 has 48 IRQs Documentation: bcm7120-l2: Add Broadcom BCM7120-style L2 binding irqchip: bcm7120-l2: Add Broadcom BCM7120-style Level 2 interrupt controller irqchip: renesas-irqc: Add binding docs for new R-Car Gen2 SoCs irqchip: renesas-irqc: Add DT binding documentation irqchip: renesas-intc-irqpin: Document SoC-specific bindings openrisc: Get rid of handle_IRQ arm64: Get rid of handle_IRQ ARM: omap2: irq: Convert to handle_domain_irq ARM: imx: tzic: Convert to handle_domain_irq ARM: imx: avic: Convert to handle_domain_irq irqchip: or1k-pic: Convert to handle_domain_irq irqchip: atmel-aic5: Convert to handle_domain_irq irqchip: atmel-aic: Convert to handle_domain_irq irqchip: gic-v3: Convert to handle_domain_irq ...
| * genirq: Add irq_domain-aware core IRQ handlerMarc Zyngier2014-09-032-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling irq_find_mapping from outside a irq_{enter,exit} section is unsafe and produces ugly messages if CONFIG_PROVE_RCU is enabled: If coming from the idle state, the rcu_read_lock call in irq_find_mapping will generate an unpleasant warning: <quote> =============================== [ INFO: suspicious RCU usage. ] 3.16.0-rc1+ #135 Not tainted ------------------------------- include/linux/rcupdate.h:871 rcu_read_lock() used illegally while idle! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! 1 lock held by swapper/0/0: #0: (rcu_read_lock){......}, at: [<ffffffc00010206c>] irq_find_mapping+0x4c/0x198 </quote> As this issue is fairly widespread and involves at least three different architectures, a possible solution is to add a new handle_domain_irq entry point into the generic IRQ code that the interrupt controller code can call. This new function takes an irq_domain, and calls into irq_find_domain inside the irq_{enter,exit} block. An additional "lookup" parameter is used to allow non-domain architecture code to be replaced by this as well. Interrupt controllers can then be updated to use the new mechanism. This code is sitting behind a new CONFIG_HANDLE_DOMAIN_IRQ, as not all architectures implement set_irq_regs (yes, mn10300, I'm looking at you...). Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: https://lkml.kernel.org/r/1409047421-27649-2-git-send-email-marc.zyngier@arm.com Signed-off-by: Jason Cooper <jason@lakedaemon.net>
* | Merge branch 'timers-core-for-linus' of ↵Linus Torvalds2014-10-091-0/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Nothing really exciting this time: - a few fixlets in the NOHZ code - a new ARM SoC timer abomination. One should expect that we have enough of them already, but they insist on inventing new ones. - the usual bunch of ARM SoC timer updates. That feels like herding cats" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable clocksource: arm_arch_timer: Enable counter access for 32-bit ARM clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable clocksource: sirf: Disable counter before re-setting it clocksource: cadence_ttc: Add support for 32bit mode clocksource: tcb_clksrc: Sanitize IRQ request clocksource: arm_arch_timer: Discard unavailable timers correctly clocksource: vf_pit_timer: Support shutdown mode ARM: meson6: clocksource: Add Meson6 timer support ARM: meson: documentation: Add timer documentation clocksource: sh_tmu: Document r8a7779 binding clocksource: sh_mtu2: Document r7s72100 binding clocksource: sh_cmt: Document SoC specific bindings timerfd: Remove an always true check nohz: Avoid tick's double reprogramming in highres mode nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
| * \ Merge branch 'nohz/drop-double-write-v3' of ↵Ingo Molnar2014-08-261-0/+8
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core Pull nohz fixes from Frederic Weisbecker: " The tick reschedules itself unconditionally. It's relevant in periodic mode but not in dynticks mode where it results in spurious double clock writes and even spurious periodic behaviour for low-res case. This set fixes that: * 1st patch removes low-res periodic tick rescheduling in nohz mode. This fixes spurious periodic behaviour. * 2nd patch does the same for high-res mode. Here there is no such spurious periodic behaviour but it still spares a double clock write in some cases. " Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | nohz: Avoid tick's double reprogramming in highres modeViresh Kumar2014-08-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In highres mode, the tick reschedules itself unconditionally to the next jiffies. However while this clock reprogramming is relevant when the tick is in periodic mode, it's not that interesting when we run in dynticks mode because irq exit is likely going to overwrite the next tick to some randomly deferred future. So lets just get rid of this tick self rescheduling in dynticks mode. This way we can avoid some clockevents double write in favourable scenarios like when we stop the tick completely in idle while no other hrtimer is pending. Suggested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| | * | nohz: Fix spurious periodic tick behaviour in low-res dynticks modeViresh Kumar2014-08-221-0/+4
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we reach the end of the tick handler, we unconditionally reschedule the next tick to the next jiffy. Then on irq exit, the nohz code overrides that setting if needed and defers the next tick as far away in the future as possible. Now in the best dynticks case, when we actually don't need any tick in the future (ie: expires == KTIME_MAX), low-res and high-res behave differently. What we want in this case is to cancel the next tick programmed by the previous one. That's what we do in high-res mode. OTOH we lack a low-res mode equivalent of hrtimer_cancel() so we simply don't do anything in this case and the next tick remains scheduled to jiffies + 1. As a result, in low-res mode, when the dynticks code determines that no tick is needed in the future, we can recursively get a spurious tick every jiffy because then the next tick is always reprogrammed from the tick handler and is never cancelled. And this can happen indefinetly until some subsystem actually needs a precise tick in the future and only then we eventually overwrite the previous tick handler setting to defer the next tick. We are fixing this by introducing the ONESHOT_STOPPED mode which will let us pause a clockevent when no further interrupt is needed. Meanwhile we can't expect all drivers to support this new mode. So lets reduce much of the symptoms by skipping the nohz-blind tick rescheduling from the tick-handler when the CPU is in dynticks mode. That tick rescheduling wrongly assumed periodicity and the low-res dynticks code can't cancel such decision. This breaks the recursive (and thus the worst) part of the problem. In the worst case now, we'll get only one extra tick due to uncancelled tick scheduled before we entered dynticks mode. This also removes a needless clockevent write on idle ticks. Since those clock write are usually considered to be slow, it's a general win. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* | | Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds2014-10-095-22/+55
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Main changes: - Fix the deadlock reported by Dave Jones et al - Clean up and fix nohz_full interaction with arch abilities - nohz init code consolidation/cleanup" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: nohz full depends on irq work self IPI support nohz: Consolidate nohz full init code arm64: Tell irq work about self IPI support arm: Tell irq work about self IPI support x86: Tell irq work about self IPI support irq_work: Force raised irq work to run on irq work interrupt irq_work: Introduce arch_irq_work_has_interrupt() nohz: Move nohz full init call to tick init
| * | | nohz: nohz full depends on irq work self IPI supportFrederic Weisbecker2014-09-131-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nohz full functionality depends on IRQ work to trigger its own interrupts. As it's used to restart the tick, we can't rely on the tick fallback for irq work callbacks, ie: we can't use the tick to restart the tick itself. Lets reject the full dynticks initialization if that arch support isn't available. As a side effect, this makes sure that nohz kick is never called from the tick. That otherwise would result in illegal hrtimer self-cancellation and lockup. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| * | | nohz: Consolidate nohz full init codeFrederic Weisbecker2014-09-131-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel parameter both have their own way to do the same thing: allocate full dynticks cpumasks, fill them and initialize some state variables. Lets consolidate that all in the same place. While at it, convert some regular printk message to warnings when fundamental allocations fail. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| * | | irq_work: Force raised irq work to run on irq work interruptFrederic Weisbecker2014-09-132-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nohz full kick, which restarts the tick when any resource depend on it, can't be executed anywhere given the operation it does on timers. If it is called from the scheduler or timers code, chances are that we run into a deadlock. This is why we run the nohz full kick from an irq work. That way we make sure that the kick runs on a virgin context. However if that's the case when irq work runs in its own dedicated self-ipi, things are different for the big bunch of archs that don't support the self triggered way. In order to support them, irq works are also handled by the timer interrupt as fallback. Now when irq works run on the timer interrupt, the context isn't blank. More precisely, they can run in the context of the hrtimer that runs the tick. But the nohz kick cancels and restarts this hrtimer and cancelling an hrtimer from itself isn't allowed. This is why we run in an endless loop: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 Workqueue: btrfs-endio-write normal_work_helper [btrfs] ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000 Call Trace: <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a [<ffffffff8a7ef928>] panic+0xd4/0x207 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0 [<ffffffff8a008268>] do_nmi+0xb8/0x100 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80 To fix this we force non-lazy irq works to run on irq work self-IPIs when available. That ability of the arch to trigger irq work self IPIs is available with arch_irq_work_has_interrupt(). Reported-by: Catalin Iacob <iacobcatalin@gmail.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
| * | | nohz: Move nohz full init call to tick initFrederic Weisbecker2014-09-132-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we unbloat a bit main.c and more importantly we initialize nohz full after init_IRQ(). This dependency will be needed in further patches because nohz full needs irq work to raise its own IRQ. Information about the support for this ability on ARM64 is obtained on init_IRQ() which initialize the pointer to __smp_call_function. Since tick_init() is called right after init_IRQ(), this is a good place to call tick_nohz_init() and prepare for that dependency. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2014-10-098-7/+2782
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
| * \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-10-082-3/+6
| |\ \ \ \
| * \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-10-022-39/+20Star
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/r8152.c net/netfilter/nfnetlink.c Both r8152 and nfnetlink conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: add search pruning optimization to verifierAlexei Starovoitov2014-10-021-0/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | consider C program represented in eBPF: int filter(int arg) { int a, b, c, *ptr; if (arg == 1) ptr = &a; else if (arg == 2) ptr = &b; else ptr = &c; *ptr = 0; return 0; } eBPF verifier has to follow all possible paths through the program to recognize that '*ptr = 0' instruction would be safe to execute in all situations. It's doing it by picking a path towards the end and observes changes to registers and stack at every insn until it reaches bpf_exit. Then it comes back to one of the previous branches and goes towards the end again with potentially different values in registers. When program has a lot of branches, the number of possible combinations of branches is huge, so verifer has a hard limit of walking no more than 32k instructions. This limit can be reached and complex (but valid) programs could be rejected. Therefore it's important to recognize equivalent verifier states to prune this depth first search. Basic idea can be illustrated by the program (where .. are some eBPF insns): 1: .. 2: if (rX == rY) goto 4 3: .. 4: .. 5: .. 6: bpf_exit In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6 Since insn#2 is a branch the verifier will remember its state in verifier stack to come back to it later. Since insn#4 is marked as 'branch target', the verifier will remember its state in explored_states[4] linked list. Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and will continue. Without search pruning optimization verifier would have to walk 4, 5, 6 again, effectively simulating execution of insns 1, 2, 4, 5, 6 With search pruning it will check whether state at #4 after jumping from #2 is equivalent to one recorded in explored_states[4] during first pass. If there is an equivalent state, verifier can prune the search at #4 and declare this path to be safe as well. In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns and 1, 2, 4 insns produces equivalent registers and stack. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: mini eBPF library, test stubs and verifier testsuiteAlexei Starovoitov2014-09-262-0/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. the library includes a trivial set of BPF syscall wrappers: int bpf_create_map(int key_size, int value_size, int max_entries); int bpf_update_elem(int fd, void *key, void *value); int bpf_lookup_elem(int fd, void *key, void *value); int bpf_delete_elem(int fd, void *key); int bpf_get_next_key(int fd, void *key, void *next_key); int bpf_prog_load(enum bpf_prog_type prog_type, const struct sock_filter_int *insns, int insn_len, const char *license); bpf_prog_load() stores verifier log into global bpf_log_buf[] array and BPF_*() macros to build instructions 2. test stubs configure eBPF infra with 'unspec' map and program types. These are fake types used by user space testsuite only. 3. verifier tests valid and invalid programs and expects predefined error log messages from kernel. 40 tests so far. $ sudo ./test_verifier #0 add+sub+mul OK #1 unreachable OK #2 unreachable2 OK #3 out of range jump OK #4 out of range jump2 OK #5 test1 ld_imm64 OK ... Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: verifier (add verifier core)Alexei Starovoitov2014-09-261-1/+1074
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds verifier core which simulates execution of every insn and records the state of registers and program stack. Every branch instruction seen during simulation is pushed into state stack. When verifier reaches BPF_EXIT, it pops the state from the stack and continues until it reaches BPF_EXIT again. For program: 1: bpf_mov r1, xxx 2: if (r1 == 0) goto 5 3: bpf_mov r0, 1 4: goto 6 5: bpf_mov r0, 2 6: bpf_exit The verifier will walk insns: 1, 2, 3, 4, 6 then it will pop the state recorded at insn#2 and will continue: 5, 6 This way it walks all possible paths through the program and checks all possible values of registers. While doing so, it checks for: - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention - instruction encoding is not using reserved fields Kernel subsystem configures the verifier with two callbacks: - bool (*is_valid_access)(int off, int size, enum bpf_access_type type); that provides information to the verifer which fields of 'ctx' are accessible (remember 'ctx' is the first argument to eBPF program) - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); returns argument constraints of kernel helper functions that eBPF program may call, so that verifier can checks that R1-R5 types match the prototype More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: verifier (add branch/goto checks)Alexei Starovoitov2014-09-261-0/+189
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check that control flow graph of eBPF program is a directed acyclic graph check_cfg() does: - detect loops - detect unreachable instructions - check that program terminates with BPF_EXIT insn - check that all branches are within program boundary Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: handle pseudo BPF_LD_IMM64 insnAlexei Starovoitov2014-09-261-0/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions to refer to process-local map_fd. Scan the program for such instructions and if FDs are valid, convert them to 'struct bpf_map' pointers which will be used by verifier to check access to maps in bpf_map_lookup/update() calls. If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping BPF_PSEUDO_MAP_FD flag. Note that eBPF interpreter is generic and knows nothing about pseudo insns. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: verifier (add ability to receive verification log)Alexei Starovoitov2014-09-262-1/+236
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add optional attributes for BPF_PROG_LOAD syscall: union bpf_attr { struct { ... __u32 log_level; /* verbosity level of eBPF verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied 'char *buffer' */ }; }; when log_level > 0 the verifier will return its verification log in the user supplied buffer 'log_buf' which can be used by program author to analyze why verifier rejected given program. 'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt provides several examples of these messages, like the program: BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), BPF_EXIT_INSN(), will be rejected with the following multi-line message in log_buf: 0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 0 4: (85) call 1 5: (15) if r0 == 0x0 goto pc+1 R0=map_ptr R10=fp 6: (7a) *(u64 *)(r0 +4) = 0 misaligned access off 4 size 8 The format of the output can change at any time as verifier evolves. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: verifier (add docs)Alexei Starovoitov2014-09-263-2/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | this patch adds all of eBPF verfier documentation and empty bpf_check() The end goal for the verifier is to statically check safety of the program. Verifier will catch: - loops - out of range jumps - unreachable instructions - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention More details in Documentation/networking/filter.txt Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: handle pseudo BPF_CALL insnAlexei Starovoitov2014-09-261-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in native eBPF programs userspace is using pseudo BPF_CALL instructions which encode one of 'enum bpf_func_id' inside insn->imm field. Verifier checks that program using correct function arguments to given func_id. If all checks passed, kernel needs to fixup BPF_CALL->imm fields by replacing func_id with in-kernel function pointer. eBPF interpreter just calls the function. In-kernel eBPF users continue to use generic BPF_CALL. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: expand BPF syscall with program load/unloadAlexei Starovoitov2014-09-262-14/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs are similar to kernel modules. They are loaded by the user process and automatically unloaded when process exits. Each eBPF program is a safe run-to-completion set of instructions. eBPF verifier statically determines that the program terminates and is safe to execute. The following syscall wrapper can be used to load the program: int bpf_prog_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, int insn_cnt, const char *license) { union bpf_attr attr = { .prog_type = prog_type, .insns = ptr_to_u64(insns), .insn_cnt = insn_cnt, .license = ptr_to_u64(license), }; return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } where 'insns' is an array of eBPF instructions and 'license' is a string that must be GPL compatible to call helper functions marked gpl_only Upon succesful load the syscall returns prog_fd. Use close(prog_fd) to unload the program. User space tests and examples follow in the later patches Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: add lookup/update/delete/iterate methods to BPF mapsAlexei Starovoitov2014-09-261-0/+235
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps are accessed from user space via BPF syscall, which has commands: - create a map with given type and attributes fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) returns fd or negative error - lookup key in a given map referenced by fd err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key, attr->value returns zero and stores found elem into value or negative error - create or update key/value pair in a given map err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key, attr->value returns zero or negative error - find and delete element by key in a given map err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key - iterate map elements (based on input key return next_key) err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key, attr->next_key - close(fd) deletes the map Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: enable bpf syscall on x64 and i386Alexei Starovoitov2014-09-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | done as separate commit to ease conflict resolution Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | bpf: introduce BPF syscall and mapsAlexei Starovoitov2014-09-262-1/+170
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF syscall is a multiplexor for a range of different operations on eBPF. This patch introduces syscall with single command to create a map. Next patch adds commands to access maps. 'maps' is a generic storage of different types for sharing data between kernel and userspace. Userspace example: /* this syscall wrapper creates a map with given type and attributes * and returns map_fd on success. * use close(map_fd) to delete the map */ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries) { union bpf_attr attr = { .map_type = map_type, .key_size = key_size, .value_size = value_size, .max_entries = max_entries }; return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } 'union bpf_attr' is backwards compatible with future extensions. More details in Documentation/networking/filter.txt and in manpage Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-09-241-1/+0Star
| |\ \ \ \ \ \
| * \ \ \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-09-236-42/+72
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/mips/net/bpf_jit.c drivers/net/can/flexcan.c Both the flexcan and MIPS bpf_jit conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selectedDaniel Borkmann2014-09-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since BPF JIT depends on the availability of module_alloc() and module_free() helpers (HAVE_BPF_JIT and MODULES), we better build that code only in case we have BPF_JIT in our config enabled, just like with other JIT code. Fixes builds for arm/marzen_defconfig and sh/rsk7269_defconfig. ==================== kernel/built-in.o: In function `bpf_jit_binary_alloc': /home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc' kernel/built-in.o: In function `bpf_jit_binary_free': /home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free' make: *** [vmlinux] Error 1 ==================== Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: 738cbe72adc5 ("net: bpf: consolidate JIT binary allocator") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | net: bpf: consolidate JIT binary allocatorDaniel Borkmann2014-09-101-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduced in commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") and later on replicated in aa2d2c73c21f ("s390/bpf,jit: address randomize and write protect jit code") for s390 architecture, write protection for BPF JIT images got added and a random start address of the JIT code, so that it's not on a page boundary anymore. Since both use a very similar allocator for the BPF binary header, we can consolidate this code into the BPF core as it's mostly JIT independant anyway. This will also allow for future archs that support DEBUG_SET_MODULE_RONX to just reuse instead of reimplementing it. JIT tested on x86_64 and s390x with BPF test suite. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | net: filter: add "load 64-bit immediate" eBPF instructionAlexei Starovoitov2014-09-091-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. All previous instructions were 8-byte. This is first 16-byte instruction. Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction: insn[0].code = BPF_LD | BPF_DW | BPF_IMM insn[0].dst_reg = destination register insn[0].imm = lower 32-bit insn[1].code = 0 insn[1].imm = upper 32-bit All unused fields must be zero. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. x64 JITs it as single 'movabsq %rax, imm64' arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn Note that old eBPF programs are binary compatible with new interpreter. It helps eBPF programs load 64-bit constant into a register with one instruction instead of using two registers and 4 instructions: BPF_MOV32_IMM(R1, imm32) BPF_ALU64_IMM(BPF_LSH, R1, 32) BPF_MOV32_IMM(R2, imm32) BPF_ALU64_REG(BPF_OR, R1, R2) User space generated programs will use this instruction to load constants only. To tell kernel that user space needs a pointer the _pseudo_ variant of this instruction may be added later, which will use extra bits of encoding to indicate what type of pointer user space is asking kernel to provide. For example 'off' or 'src_reg' fields can be used for such purpose. src_reg = 1 could mean that user space is asking kernel to validate and load in-kernel map pointer. src_reg = 2 could mean that user space needs readonly data section pointer src_reg = 3 could mean that user space needs a pointer to per-cpu local data All such future pseudo instructions will not be carrying the actual pointer as part of the instruction, but rather will be treated as a request to kernel to provide one. The kernel will verify the request_for_a_pointer, then will drop _pseudo_ marking and will store actual internal pointer inside the instruction, so the end result is the interpreter and JITs never see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that loads 64-bit immediate into a register. User space never operates on direct pointers and verifier can easily recognize request_for_pointer vs other instructions. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-09-0816-143/+326
| |\ \ \ \ \ \ \ \
| * | | | | | | | | net: bpf: make eBPF interpreter images read-onlyDaniel Borkmann2014-09-052-6/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With eBPF getting more extended and exposure to user space is on it's way, hardening the memory range the interpreter uses to steer its command flow seems appropriate. This patch moves the to be interpreted bytecode to read-only pages. In case we execute a corrupted BPF interpreter image for some reason e.g. caused by an attacker which got past a verifier stage, it would not only provide arbitrary read/write memory access but arbitrary function calls as well. After setting up the BPF interpreter image, its contents do not change until destruction time, thus we can setup the image on immutable made pages in order to mitigate modifications to that code. The idea is derived from commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit against spraying attacks"). This is possible because bpf_prog is not part of sk_filter anymore. After setup bpf_prog cannot be altered during its life-time. This prevents any modifications to the entire bpf_prog structure (incl. function/JIT image pointer). Every eBPF program (including classic BPF that are migrated) have to call bpf_prog_select_runtime() to select either interpreter or a JIT image as a last setup step, and they all are being freed via bpf_prog_free(), including non-JIT. Therefore, we can easily integrate this into the eBPF life-time, plus since we directly allocate a bpf_prog, we have no performance penalty. Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual inspection of kernel_page_tables. Brad Spengler proposed the same idea via Twitter during development of this patch. Joint work with Hannes Frederic Sowa. Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | crash_dump: Make is_kdump_kernel() accessible from modulesAmir Vadai2014-08-261-0/+1
| | |_|_|_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to make is_kdump_kernel() accessible from modules, need to make elfcorehdr_addr exported. This was rejected in the past [1] because reset_devices was prefered in that context (reseting the device in kdump kernel), but now there are some network drivers that need to reduce memory usage when loaded from a kdump kernel. And in that context, is_kdump_kernel() suits better. [1] - https://lkml.org/lkml/2011/1/27/341 CC: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | | | Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-10-081-0/+36
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - fix an NFSv4.1 state renewal regression - fix open/lock state recovery error handling - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails - fix statd when reconnection fails - don't wake tasks during connection abort - don't start reboot recovery if lease check fails - fix duplicate proc entries Features: - pNFS block driver fixes and clean ups from Christoph - More code cleanups from Anna - Improve mmap() writeback performance - Replace use of PF_TRANS with a more generic mechanism for avoiding deadlocks in nfs_release_page" * tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits) NFSv4.1: Fix an NFSv4.1 state renewal regression NFSv4: fix open/lock state recovery error handling NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails NFS: Fabricate fscache server index key correctly SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT NFSv3: Fix missing includes of nfs3_fs.h NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() NFS: avoid waiting at all in nfs_release_page when congested. NFS: avoid deadlocks with loop-back mounted NFS filesystems. MM: export page_wakeup functions SCHED: add some "wait..on_bit...timeout()" interfaces. NFS: don't use STABLE writes during writeback. NFSv4: use exponential retry on NFS4ERR_DELAY for async requests. rpc: Add -EPERM processing for xs_udp_send_request() rpc: return sent and err from xs_sendpages() lockd: Try to reconnect if statd has moved SUNRPC: Don't wake tasks during connection abort Fixing lease renewal nfs: fix duplicate proc entries pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe ...
| * | | | | | | | | SCHED: add some "wait..on_bit...timeout()" interfaces.NeilBrown2014-09-251-0/+36
| | |_|_|_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit c1221321b7c25b53204447cff9949a6d5a7ddddc sched: Allow wait_on_bit_action() functions to support a timeout I suggested that a "wait_on_bit_timeout()" interface would not meet my need. This isn't true - I was just over-engineering. Including a 'private' field in wait_bit_key instead of a focused "timeout" field was just premature generalization. If some other use is ever found, it can be generalized or added later. So this patch renames "private" to "timeout" with a meaning "stop waiting when "jiffies" reaches or passes "timeout", and adds two of the many possible wait..bit..timeout() interfaces: wait_on_page_bit_killable_timeout(), which is the one I want to use, and out_of_line_wait_on_bit_timeout() which is a reasonably general example. Others can be added as needed. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2014-10-081-0/+12
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull crypto update from Herbert Xu: - add multibuffer infrastructure (single_task_running scheduler helper, OKed by Peter on lkml. - add SHA1 multibuffer implementation for AVX2. - reenable "by8" AVX CTR optimisation after fixing counter overflow. - add APM X-Gene SoC RNG support. - SHA256/SHA512 now handles unaligned input correctly. - set lz4 decompressed length correctly. - fix algif socket buffer allocation failure for 64K page machines. - misc fixes * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (47 commits) crypto: sha - Handle unaligned input data in generic sha256 and sha512. Revert "crypto: aesni - disable "by8" AVX CTR optimization" crypto: aesni - remove unused defines in "by8" variant crypto: aesni - fix counter overflow handling in "by8" variant hwrng: printk replacement crypto: qat - Removed unneeded partial state crypto: qat - Fix typo in name of tasklet_struct crypto: caam - Dynamic allocation of addresses for various memory blocks in CAAM. crypto: mcryptd - Fix typos in CRYPTO_MCRYPTD description crypto: algif - avoid excessive use of socket buffer in skcipher arm64: dts: add random number generator dts node to APM X-Gene platform. Documentation: rng: Add X-Gene SoC RNG driver documentation hwrng: xgene - add support for APM X-Gene SoC RNG support crypto: mv_cesa - Add missing #define crypto: testmgr - add test for lz4 and lz4hc crypto: lz4,lz4hc - fix decompression crypto: qat - Use pci_enable_msix_exact() instead of pci_enable_msix() crypto: drbg - fix maximum value checks on 32 bit systems crypto: drbg - fix sparse warning for cpu_to_be[32|64] crypto: sha-mb - sha1_mb_alg_state can be static ...
| * | | | | | | | | sched: Add function single_task_running to let a task check if it is the ↵Tim Chen2014-08-251-0/+12
| | |/ / / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | only task running on a cpu This function will help an async task processing batched jobs from workqueue decide if it wants to keep processing on more chunks of batched work that can be delayed, or to accumulate more work for more efficient batched processing later. If no other tasks are running on the cpu, the batching process can take advantgae of the available cpu cycles to a make decision to continue processing the existing accumulated work to minimize delay, otherwise it will yield. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | | | | | | | Merge tag 'arm64-upstream' of ↵Linus Torvalds2014-10-081-1/+1
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - eBPF JIT compiler for arm64 - CPU suspend backend for PSCI (firmware interface) with standard idle states defined in DT (generic idle driver to be merged via a different tree) - Support for CONFIG_DEBUG_SET_MODULE_RONX - Support for unmapped cpu-release-addr (outside kernel linear mapping) - set_arch_dma_coherent_ops() implemented and bus notifiers removed - EFI_STUB improvements when base of DRAM is occupied - Typos in KGDB macros - Clean-up to (partially) allow kernel building with LLVM - Other clean-ups (extern keyword, phys_addr_t usage) * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (51 commits) arm64: Remove unneeded extern keyword ARM64: make of_device_ids const arm64: Use phys_addr_t type for physical address aarch64: filter $x from kallsyms arm64: Use DMA_ERROR_CODE to denote failed allocation arm64: Fix typos in KGDB macros arm64: insn: Add return statements after BUG_ON() arm64: debug: don't re-enable debug exceptions on return from el1_dbg Revert "arm64: dmi: Add SMBIOS/DMI support" arm64: Implement set_arch_dma_coherent_ops() to replace bus notifiers of: amba: use of_dma_configure for AMBA devices arm64: dmi: Add SMBIOS/DMI support arm64: Correct ftrace calls to aarch64_insn_gen_branch_imm() arm64:mm: initialize max_mapnr using function set_max_mapnr setup: Move unmask of async interrupts after possible earlycon setup arm64: LLVMLinux: Fix inline arm64 assembly for use with clang arm64: pageattr: Correctly adjust unaligned start addresses net: bpf: arm64: fix module memory leak when JIT image build fails arm64: add PSCI CPU_SUSPEND based cpu_suspend support arm64: kernel: introduce cpu_init_idle CPU operation ...
| * | | | | | | | | aarch64: filter $x from kallsymsKyle McMartin2014-10-021-1/+1
| | |/ / / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to ARM, AArch64 is generating $x and $d syms... which isn't terribly helpful when looking at %pF output and the like. Filter those out in kallsyms, modpost and when looking at module symbols. Seems simplest since none of these check EM_ARM anyway, to just add it to the strchr used, rather than trying to make things overly complicated. initcall_debug improves: dmesg_before.txt: initcall $x+0x0/0x154 [sg] returned 0 after 26331 usecs dmesg_after.txt: initcall init_sg+0x0/0x154 [sg] returned 0 after 15461 usecs Signed-off-by: Kyle McMartin <kyle@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
* | | | | | | | | Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-armLinus Torvalds2014-10-081-1/+1
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ARM updates from Russell King: "Included in these updates are: - Performance optimisation to avoid writing the control register at every exception. - Use static inline instead of extern inline in ftrace code. - Crypto ARM assembly updates for big endian - Alignment of initrd/.init memory to page sizes when freeing to ensure that we fully free the regions - Add gcov support - A couple of preparatory patches for VDSO support: use _install_special_mapping, and randomize the sigpage placement above stack. - Add L2 ePAPR DT cache properties so that DT can specify the cache geometry. - Preparatory patch for FIQ (NMI) kernel C code for things like spinlock lockup debug. Following on from this are a couple of my patches cleaning up show_regs() and removing an unused (probably since 1.x days) do_unexp_fiq() function. - Use pr_warn() rather than pr_warning(). - A number of cleanups (smp, footbridge, return_address)" * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (21 commits) ARM: 8167/1: extend the reserved memory for initrd to be page aligned ARM: 8168/1: extend __init_end to a page align address ARM: 8169/1: l2c: parse cache properties from ePAPR definitions ARM: 8160/1: drop warning about return_address not using unwind tables ARM: 8161/1: footbridge: select machine dir based on ARCH_FOOTBRIDGE ARM: 8158/1: LLVMLinux: use static inline in ARM ftrace.h ARM: 8155/1: place sigpage at a random offset above stack ARM: 8154/1: use _install_special_mapping for sigpage ARM: 8153/1: Enable gcov support on the ARM architecture ARM: Avoid writing to control register on every exception ARM: 8152/1: Convert pr_warning to pr_warn ARM: remove unused do_unexp_fiq() function ARM: remove extraneous newline in show_regs() ARM: 8150/3: fiq: Replace default FIQ handler ARM: 8140/1: ep93xx: Enable DEBUG_LL_UART_PL01X ARM: 8139/1: versatile: Enable DEBUG_LL_UART_PL01X ARM: 8138/1: drop ISAR0 workaround for B15 ARM: 8136/1: sa1100: add Micro ASIC platform device ARM: 8131/1: arm/smp: Absorb boot_secondary() ARM: 8126/1: crypto: enable NEON SHA-384/SHA-512 for big endian ...
| | \ \ \ \ \ \ \ \
| | \ \ \ \ \ \ \ \
| | \ \ \ \ \ \ \ \
| | \ \ \ \ \ \ \ \
| *---. \ \ \ \ \ \ \ \ Merge branches 'fiq' (early part), 'fixes', 'l2c' (early part) and 'misc' ↵Russell King2014-10-025-41/+31Star
| |\ \ \ \ \ \ \ \ \ \ \ | | | | |_|_|_|_|_|/ / / | | | |/| | | | | | | / | | | |_|_|_|_|_|_|_|/ | | |/| | | | | | | | into for-next