summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'pm-for-linus' of ↵Linus Torvalds2012-01-172-8/+8
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm * 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / Hibernate: Drop the check of swap space size for compressed image PM / shmobile: fix A3SP suspend method PM / Domains: Skip governor functions for CONFIG_PM_RUNTIME unset PM / Domains: Fix build for CONFIG_PM_SLEEP unset PM: Make sysrq-o be available for CONFIG_PM unset
| * PM / Hibernate: Drop the check of swap space size for compressed imageBarry Song2012-01-141-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | For compressed image, the space required is not known until we finish compressing and writing all pages. This patch drops the check, and if swap space is not enough finally, system can still restore to normal after writing swap fails for compressed images. Signed-off-by: Barry Song <Baohua.Song@csr.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
| * PM: Make sysrq-o be available for CONFIG_PM unsetRafael J. Wysocki2012-01-141-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 1eb208aea3179dd2fc0cdeea45ef869d75b4fe70, "PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME)", the files under kernel/power are not built unless CONFIG_PM_SLEEP or CONFIG_PM_RUNTIME is set. In particular, this causes kernel/power/poweroff.c to be omitted, even though it should be compiled, because CONFIG_MAGIC_SYSRQ is set. Fix the problem by causing kernel/power/Makefile to be processed for CONFIG_PM unset too. Reported-and-tested-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
* | error: implicit declaration of function 'module_flags_taint'Kevin Winchester2012-01-161-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent changes to kernel/module.c caused the following compile error: kernel/module.c: In function ‘show_taint’: kernel/module.c:1024:2: error: implicit declaration of function ‘module_flags_taint’ [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors Correct this error by moving the definition of module_flags_taint outside of the #ifdef CONFIG_MODULE_UNLOAD section. Signed-off-by: Kevin Winchester <kjwinchester@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds2012-01-151-3/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits) Revert "block: recursive merge requests" block: Stop using macro stubs for the bio data integrity calls blockdev: convert some macros to static inlines fs: remove unneeded plug in mpage_readpages() block: Add BLKROTATIONAL ioctl block: Introduce blk_set_stacking_limits function block: remove WARN_ON_ONCE() in exit_io_context() block: an exiting task should be allowed to create io_context block: ioc_cgroup_changed() needs to be exported block: recursive merge requests block, cfq: fix empty queue crash caused by request merge block, cfq: move icq creation and rq->elv.icq association to block core block, cfq: restructure io_cq creation path for io_context interface cleanup block, cfq: move io_cq exit/release to blk-ioc.c block, cfq: move icq cache management to block core block, cfq: move io_cq lookup to blk-ioc.c block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq block, cfq: reorganize cfq_io_context into generic and cfq specific parts block: remove elevator_queue->ops block: reorder elevator switch sequence ... Fix up conflicts in: - block/blk-cgroup.c Switch from can_attach_task to can_attach - block/cfq-iosched.c conflict with now removed cic index changes (we now use q->id instead)
| * | block, cfq: unlink cfq_io_context's immediatelyTejun Heo2011-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's *far* simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: make ioc get/put interface more conventional and fix race on alloctionTejun Heo2011-12-141-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2012-01-153-345/+683
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) perf tools: Fix compile error on x86_64 Ubuntu perf report: Fix --stdio output alignment when --showcpuutilization used perf annotate: Get rid of field_sep check perf annotate: Fix usage string perf kmem: Fix a memory leak perf kmem: Add missing closedir() calls perf top: Add error message for EMFILE perf test: Change type of '-v' option to INCR perf script: Add missing closedir() calls tracing: Fix compile error when static ftrace is enabled recordmcount: Fix handling of elf64 big-endian objects. perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again perf tools: Add support for guest/host-only profiling perf kvm: Do guest-only counting by default perf top: Don't update total_period on process_sample perf hists: Stop using 'self' for struct hist_entry perf hists: Rename total_session to total_period x86: Add counter when debug stack is used with interrupts enabled x86: Allow NMIs to hit breakpoints in i386 x86: Keep current stack in NMI breakpoints ...
| * \ \ Merge branch 'tip/perf/urgent' of ↵Ingo Molnar2012-01-0812-39/+98
| |\ \ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
| * \ \ \ Merge branch 'tip/perf/core' of ↵Ingo Molnar2012-01-083-345/+683
| |\ \ \ \ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
| | * | | | tracing: Factorize filter creationTejun Heo2011-12-211-141/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are four places where new filter for a given filter string is created, which involves several different steps. This patch factors those steps into create_[system_]filter() functions which in turn make use of create_filter_{start|finish}() for common parts. The only functional change is that if replace_filter_string() is requested and fails, creation fails without any side effect instead of being ignored. Note that system filter is now installed after the processing is complete which makes freeing before and then restoring filter string on error unncessary. -v2: Rebased to resolve conflict with 49aa29513e and updated both create_filter() functions to always set *filterp instead of requiring the caller to clear it to %NULL on entry. Link: http://lkml.kernel.org/r/1323988305-1469-2-git-send-email-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | tracing: Have stack tracing set filtered functions at bootSteven Rostedt2011-12-211-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add stacktrace_filter= to the kernel command line that lets the user pick specific functions to check the stack on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Allow access to the boot time function enablingSteven Rostedt2011-12-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change set_ftrace_early_filter() to ftrace_set_early_filter() and make it a global function. This will allow other subsystems in the kernel to be able to enable function tracing at start up and reuse the ftrace function parsing code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | tracing: Have stack_tracer use a separate list of functionsSteven Rostedt2011-12-211-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The stack_tracer is used to look at every function and check if the current stack is bigger than the last recorded max stack size. When a new max is found, then it saves that stack off. Currently the stack tracer is limited by the global_ops of the function tracer. As the stack tracer has nothing to do with the ftrace function tracer, except that it uses it as its internal engine, the stack tracer should have its own list. A new file is added to the tracing debugfs directory called: stack_trace_filter that can be used to select which functions you want to check the stack on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Decouple hash items from showing filtered functionsSteven Rostedt2011-12-211-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The set_ftrace_filter shows "hashed" functions, which are functions that are added with operations to them (like traceon and traceoff). As other subsystems may be able to show what functions they are using for function tracing, the hash items should no longer be shown just because the FILTER flag is set. As they have nothing to do with other subsystems filters. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Allow other users of function tracing to use the output listingSteven Rostedt2011-12-211-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function tracer is set up to allow any other subsystem (like perf) to use it. Ftrace already has a way to list what functions are enabled by the global_ops. It would be very helpful to let other users of the function tracer to be able to use the same code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Create ftrace_hash_empty() helper routineSteven Rostedt2011-12-211-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two types of hashes in the ftrace_ops; one type is the filter_hash and the other is the notrace_hash. Either one may be null, meaning it has no elements. But when elements are added, the hash is allocated. Throughout the code, a check needs to be made to see if a hash exists or the hash has elements, but the check if the hash exists is usually missing causing the possible "NULL pointer dereference bug". Add a helper routine called "ftrace_hash_empty()" that returns true if the hash doesn't exist or its count is zero. As they mean the same thing. Last-bug-reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Fix ftrace hash record update with notraceSteven Rostedt2011-12-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When disabling the "notrace" records, that means we want to trace them. If the notrace_hash is zero, it means that we want to trace all records. But to disable a zero notrace_hash means nothing. The check for the notrace_hash count was incorrect with: if (hash && !hash->count) return With the correct comment above it that states that we do nothing if the notrace_hash has zero count. But !hash also means that the notrace hash has zero count. I think this was done to protect against dereferencing NULL. But if !hash is true, then we go through the following loop without doing a single thing. Fix it to: if (!hash || !hash->count) return; Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Use bsearch to find record ipSteven Rostedt2011-12-211-15/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that each set of pages in the function list are sorted by ip, we can use bsearch to find a record within each set of pages. This speeds up the ftrace_location() function by magnitudes. For archs (like x86) that need to add a breakpoint at every function that will be converted from a nop to a callback and vice versa, the breakpoint callback needs to know if the breakpoint was for ftrace or not. It requires finding the breakpoint ip within the records. Doing a linear search is extremely inefficient. It is a must to be able to do a fast binary search to find these locations. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Sort the mcount records on each pageSteven Rostedt2011-12-211-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sort records by ip locations of the ftrace mcount calls on each of the set of pages in the function list. This helps in localizing cache usuage when updating the function locations, as well as gives us the ability to quickly find an ip location in the list. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Replace record newlist with record page listSteven Rostedt2011-12-211-31/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As new functions come in to be initalized from mcount to nop, they are done by groups of pages. Whether it is the core kernel or a module. There's no need to keep track of these on a per record basis. At startup, and as any module is loaded, the functions to be traced are stored in a group of pages and added to the function list at the end. We just need to keep a pointer to the first page of the list that was added, and use that to know where to start on the list for initializing functions. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Allocate the mcount record pages as groupsSteven Rostedt2011-12-211-51/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate the mcount record pages as a group of pages as big as can be allocated and waste no more than a single page. Grouping the mcount pages as much as possible helps with cache locality, as we do not need to redirect with descriptors as we cross from page to page. It also allows us to do more with the records later on (sort them with bigger benefits). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Remove usage of "freed" recordsSteven Rostedt2011-12-211-51/+49Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Records that are added to the function trace table are permanently there, except for modules. By separating out the modules to their own pages that can be freed in one shot we can remove the "freed" flag and simplify some of the record management. Another benefit of doing this is that we can also move the records around; sort them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Allow archs to modify code without stop machineSteven Rostedt2011-12-211-38/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The stop machine method to modify all functions in the kernel (some 20,000 of them) is the safest way to do so across all archs. But some archs may not need this big hammer approach to modify code on SMP machines, and can simply just update the code it needs. Adding a weak function arch_ftrace_update_code() that now does the stop machine, will also let any arch override this method. If the arch needs to check the system and then decide if it can avoid stop machine, it can still call ftrace_run_stop_machine() to use the old method. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| | * | | | ftrace: Fix unregister ftrace_ops accountingJiri Olsa2011-12-211-14/+13Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple users of the function tracer can register their functions with the ftrace_ops structure. The accounting within ftrace will update the counter on each function record that is being traced. When the ftrace_ops filtering adds or removes functions, the function records will be updated accordingly if the ftrace_ops is still registered. When a ftrace_ops is removed, the counter of the function records, that the ftrace_ops traces, are decremented. When they reach zero the functions that they represent are modified to stop calling the mcount code. When changes are made, the code is updated via stop_machine() with a command passed to the function to tell it what to do. There is an ENABLE and DISABLE command that tells the called function to enable or disable the functions. But the ENABLE is really a misnomer as it should just update the records, as records that have been enabled and now have a count of zero should be disabled. The DISABLE command is used to disable all functions regardless of their counter values. This is the big off switch and is not the complement of the ENABLE command. To make matters worse, when a ftrace_ops is unregistered and there is another ftrace_ops registered, neither the DISABLE nor the ENABLE command are set when calling into the stop_machine() function and the records will not be updated to match their counter. A command is passed to that function that will update the mcount code to call the registered callback directly if it is the only one left. This means that the ftrace_ops that is still registered will have its callback called by all functions that have been set for it as well as the ftrace_ops that was just unregistered. Here's a way to trigger this bug. Compile the kernel with CONFIG_FUNCTION_PROFILER set and with CONFIG_FUNCTION_GRAPH not set: CONFIG_FUNCTION_PROFILER=y # CONFIG_FUNCTION_GRAPH is not set This will force the function profiler to use the function tracer instead of the function graph tracer. # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function > current_tracer # cat set_ftrace_filter schedule # cat trace # tracer: nop # # entries-in-buffer/entries-written: 692/68108025 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | kworker/0:2-909 [000] .... 531.235574: schedule <-worker_thread <idle>-0 [001] .N.. 531.235575: schedule <-cpu_idle kworker/0:2-909 [000] .... 531.235597: schedule <-worker_thread sshd-2563 [001] .... 531.235647: schedule <-schedule_hrtimeout_range_clock # echo 1 > function_profile_enabled # echo 0 > function_porfile_enabled # cat set_ftrace_filter schedule # cat trace # tracer: function # # entries-in-buffer/entries-written: 159701/118821262 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <idle>-0 [002] ...1 604.870655: local_touch_nmi <-cpu_idle <idle>-0 [002] d..1 604.870655: enter_idle <-cpu_idle <idle>-0 [002] d..1 604.870656: atomic_notifier_call_chain <-enter_idle <idle>-0 [002] d..1 604.870656: __atomic_notifier_call_chain <-atomic_notifier_call_chain The same problem could have happened with the trace_probe_ops, but they are modified with the set_frace_filter file which does the update at closure of the file. The simple solution is to change ENABLE to UPDATE and call it every time an ftrace_ops is unregistered. Link: http://lkml.kernel.org/r/1323105776-26961-3-git-send-email-jolsa@redhat.com Cc: stable@vger.kernel.org # 3.0+ Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | | | | | Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-securityLinus Torvalds2012-01-154-40/+60
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security: capabilities: remove __cap_full_set definition security: remove the security_netlink_recv hook as it is equivalent to capable() ptrace: do not audit capability check when outputing /proc/pid/stat capabilities: remove task_ns_* functions capabitlies: ns_capable can use the cap helpers rather than lsm call capabilities: style only - move capable below ns_capable capabilites: introduce new has_ns_capabilities_noaudit capabilities: call has_ns_capability from has_capability capabilities: remove all _real_ interfaces capabilities: introduce security_capable_noaudit capabilities: reverse arguments to security_capable capabilities: remove the task from capable LSM hook entirely selinux: sparse fix: fix several warnings in the security server cod selinux: sparse fix: fix warnings in netlink code selinux: sparse fix: eliminate warnings for selinuxfs selinux: sparse fix: declare selinux_disable() in security.h selinux: sparse fix: move selinux_complete_init selinux: sparse fix: make selinux_secmark_refcount static SELinux: Fix RCU deref check warning in sel_netport_insert() Manually fix up a semantic mis-merge wrt security_netlink_recv(): - the interface was removed in commit fd7784615248 ("security: remove the security_netlink_recv hook as it is equivalent to capable()") - a new user of it appeared in commit a38f7907b926 ("crypto: Add userspace configuration API") causing no automatic merge conflict, but Eric Paris pointed out the issue.
| * | | | | | security: remove the security_netlink_recv hook as it is equivalent to capable()Eric Paris2012-01-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once upon a time netlink was not sync and we had to get the effective capabilities from the skb that was being received. Today we instead get the capabilities from the current task. This has rendered the entire purpose of the hook moot as it is now functionally equivalent to the capable() call. Signed-off-by: Eric Paris <eparis@redhat.com>
| * | | | | | ptrace: do not audit capability check when outputing /proc/pid/statEric Paris2012-01-061-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reading /proc/pid/stat of another process checks if one has ptrace permissions on that process. If one does have permissions it outputs some data about the process which might have security and attack implications. If the current task does not have ptrace permissions the read still works, but those fields are filled with inocuous (0) values. Since this check and a subsequent denial is not a violation of the security policy we should not audit such denials. This can be quite useful to removing ptrace broadly across a system without flooding the logs when ps is run or something which harmlessly walks proc. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilities: remove task_ns_* functionsEric Paris2012-01-063-17/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task_ in the front of a function, in the security subsystem anyway, means to me at least, that we are operating with that task as the subject of the security decision. In this case what it means is that we are using current as the subject but we use the task to get the right namespace. Who in the world would ever realize that's what task_ns_capability means just by the name? This patch eliminates the task_ns functions entirely and uses the has_ns_capability function instead. This means we explicitly open code the ns in question in the caller. I think it makes the caller a LOT more clear what is going on. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabitlies: ns_capable can use the cap helpers rather than lsm callEric Paris2012-01-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just to reduce the number of places to change if we every change the LSM hook, use the capability helpers internally when possible. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilities: style only - move capable below ns_capableEric Paris2012-01-061-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although the current code is fine for consistency this moves the capable code below the function it calls in the c file. It doesn't actually change code. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilites: introduce new has_ns_capabilities_noauditEric Paris2012-01-061-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For consistency in interfaces, introduce a new interface called has_ns_capabilities_noaudit. It checks if the given task has the given capability in the given namespace. Use this new function by has_capabilities_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilities: call has_ns_capability from has_capabilityEric Paris2012-01-061-16/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare the more specific has_ns_capability first in the code and then call it from has_capability. The declaration reversal isn't stricty necessary since they are both declared in header files, but it just makes sense to put more specific functions first in the code. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilities: remove all _real_ interfacesEric Paris2012-01-061-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The name security_real_capable and security_real_capable_noaudit just don't make much sense to me. Convert them to use security_capable and security_capable_noaudit. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
| * | | | | | capabilities: reverse arguments to security_capableEric Paris2012-01-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | security_capable takes ns, cred, cap. But the LSM capable() hook takes cred, ns, cap. The capability helper functions also take cred, ns, cap. Rather than flip argument order just to flip it back, leave them alone. Heck, this should be a little faster since argument will be in the right place! Signed-off-by: Eric Paris <eparis@redhat.com>
* | | | | | | Merge tag 'for-linus' of git://github.com/rustyrussell/linuxLinus Torvalds2012-01-147-115/+146
|\ \ \ \ \ \ \ | |_|_|_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1 * tag 'for-linus' of git://github.com/rustyrussell/linux: module_param: check that bool parameters really are bool. intelfbdrv.c: bailearly is an int module_param paride/pcd: fix bool verbose module parameter. module_param: make bool parameters really bool (drivers & misc) module_param: make bool parameters really bool (arch) module_param: make bool parameters really bool (core code) kernel/async: remove redundant declaration. printk: fix unnecessary module_param_name. lirc_parallel: fix module parameter description. module_param: avoid bool abuse, add bint for special cases. module_param: check type correctness for module_param_array modpost: use linker section to generate table. modpost: use a table rather than a giant if/else statement. modules: sysfs - export: taint, coresize, initsize kernel/params: replace DEBUGP with pr_debug module: replace DEBUGP with pr_debug module: struct module_ref should contains long fields module: Fix performance regression on modules with large symbol tables module: Add comments describing how the "strmap" logic works Fix up conflicts in scripts/mod/file2alias.c due to the new linker- generated table approach to adding __mod_*_device_table entries. The ARM sa11x0 mcp bus needed to be converted to that too.
| * | | | | | module_param: make bool parameters really bool (core code)Rusty Russell2012-01-133-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | kernel/async: remove redundant declaration.Rusty Russell2012-01-131-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's in linux/init.h, and I'm about to change it to a bool. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | printk: fix unnecessary module_param_name.Rusty Russell2012-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | You don't need module_param_name if the name is the same! Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | module_param: avoid bool abuse, add bint for special cases.Rusty Russell2012-01-131-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For historical reasons, we allow module_param(bool) to take an int (or an unsigned int). That's going away. A few drivers really want an int: they set it to -1 and a parameter will set it to 0 or 1. This sucks: reading them from sysfs will give 'Y' for both -1 and 1, but if we change it to an int, then the users might be broken (if they did "param" instead of "param=1"). Use a new 'bint' parser for them. (ntfs has a different problem: it needs an int for debug_msgs because it's also exposed via sysctl.) Cc: Steve Glendinning <steve.glendinning@smsc.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: Hoang-Nam Nguyen <hnguyen@de.ibm.com> Cc: Christoph Raisch <raisch@de.ibm.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: linux390@de.ibm.com Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.de> Cc: lm-sensors@lm-sensors.org Cc: linux-rdma@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: alsa-devel@alsa-project.org Acked-by: Takashi Iwai <tiwai@suse.de> (For the sound part) Acked-by: Guenter Roeck <guenter.roeck@ericsson.com> (For the hwmon driver) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | modules: sysfs - export: taint, coresize, initsizeKay Sievers2012-01-131-29/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent tools do not want to use /proc to retrieve module information. A few values are currently missing from sysfs to replace the information available in /proc/modules. This adds /sys/module/*/{coresize,initsize,taint} attributes. TAINT_PROPRIETARY_MODULE (P) and TAINT_OOT_MODULE (O) flags are both always shown now, and do no longer exclude each other, also in /proc/modules. Replace the open-coded sysfs attribute initializers with the __ATTR() macro. Add the new attributes to Documentation/ABI. Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | kernel/params: replace DEBUGP with pr_debugJim Cromie2012-01-131-10/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use more flexible pr_debug. This allows: echo "module params +p" > /dbg/dynamic_debug/control to turn on debug messages when needed. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | module: replace DEBUGP with pr_debugJim Cromie2012-01-131-26/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use more flexible pr_debug. This allows: echo "module module +p" > /dbg/dynamic_debug/control to turn on debug messages when needed. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | module: struct module_ref should contains long fieldsEric Dumazet2012-01-132-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | module_ref contains two "unsigned int" fields. Thats now too small, since some machines can open more than 2^32 files. Check commit 518de9b39e8 (fs: allow for more than 2^31 files) for reference. We can add an aligned(2 * sizeof(unsigned long)) attribute to force alloc_percpu() allocating module_ref areas in single cache lines. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Rusty Russell <rusty@rustcorp.com.au> CC: Tejun Heo <tj@kernel.org> CC: Robin Holt <holt@sgi.com> CC: David Miller <davem@davemloft.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| * | | | | | module: Fix performance regression on modules with large symbol tablesKevin Cernekee2012-01-131-44/+21Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Looking at /proc/kallsyms, one starts to ponder whether all of the extra strtab-related complexity in module.c is worth the memory savings. Instead of making the add_kallsyms() loop even more complex, I tried the other route of deleting the strmap logic and naively copying each string into core_strtab with no consideration for consolidating duplicates. Performance on an "already exists" insmod of nvidia.ko (runs add_kallsyms() but does not actually initialize the module): Original scheme: 1.230s With naive copying: 0.058s Extra space used: 35k (of a 408k module). Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <73defb5e4bca04a6431392cc341112b1@localhost>
| * | | | | | module: Add comments describing how the "strmap" logic worksKevin Cernekee2012-01-131-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* | | | | | | c/r: prctl: add PR_SET_MM codes to set up mm_struct entriesCyrill Gorcunov2012-01-131-0/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we restore a task we need to set up text, data and data heap sizes from userspace to the values a task had at checkpoint time. This patch adds auxilary prctl codes for that. While most of them have a statistical nature (their values are involved into calculation of /proc/<pid>/statm output) the start_brk and brk values are used to compute an allowed size of program data segment expansion. Which means an arbitrary changes of this values might be dangerous operation. So to restrict access the following requirements applied to prctl calls: - The process has to have CAP_SYS_ADMIN capability granted. - For all opcodes except start_brk/brk members an appropriate VMA area must exist and should fit certain VMA flags, such as: - code segment must be executable but not writable; - data segment must not be executable. start_brk/brk values must not intersect with data segment and must not exceed RLIMIT_DATA resource limit. Still the main guard is CAP_SYS_ADMIN capability check. Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support otherwise these prctl calls will return -EINVAL. [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | panic: don't print redundant backtraces on oopsAndi Kleen2012-01-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an oops causes a panic and panic prints another backtrace it's pretty common to have the original oops data be scrolled away on a 80x50 screen. The second backtrace is quite redundant and not needed anyways. So don't print the panic backtrace when oops_in_progress is true. [akpm@linux-foundation.org: add comment] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | sysctl: add the kernel.ns_last_pid controlPavel Emelyanov2012-01-132-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sysctl works on the current task's pid namespace, getting and setting its last_pid field. Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to create a task with desired pid value. This ability is required badly for the checkpoint/restore in userspace. This approach suits all the parties for now. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | kdump: fix crash_kexec()/smp_send_stop() race in panic()Michael Holzheu2012-01-131-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When two CPUs call panic at the same time there is a possible race condition that can stop kdump. The first CPU calls crash_kexec() and the second CPU calls smp_send_stop() in panic() before crash_kexec() finished on the first CPU. So the second CPU stops the first CPU and therefore kdump fails: 1st CPU: panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump 2nd CPU: panic()->crash_kexec()->kexec_mutex already held by 1st CPU ->smp_send_stop()-> stop 1st CPU (stop kdump) This patch fixes the problem by introducing a spinlock in panic that allows only one CPU to process crash_kexec() and the subsequent panic code. All other CPUs call the weak function panic_smp_self_stop() that stops the CPU itself. This function can be overloaded by architecture code. For example "tile" can use their lower-power "nap" instruction for that. Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>