summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* sched: fix shares boost logicPeter Zijlstra2008-06-271-0/+3
| | | | | | | | | | | In case the domain is empty, pretend there is a single task on each cpu, so that together with the boost logic we end up giving 1/n shares to each cpu. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: disable source/target_load biasPeter Zijlstra2008-06-272-2/+3
| | | | | | | | | | The bias given by source/target_load functions can be very large, disable it by default to get faster convergence. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: optimize effective_load()Peter Zijlstra2008-06-271-4/+4
| | | | | | | | | | | | | | | | | | | | | | s_i = S * rw_i / \Sum_j rw_j -> \Sum_j rw_j = S * rw_i / s_i -> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w) delta s = s' - s = S * (rw + w) / ((S * rw / s) + w) = s * (S * (rw + w) / (S * rw + s * w) - 1) a = S*(rw+w), b = S*rw + s*w delta s = s * (a-b) / b IOW, trade one divide for two multiplies Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove prio preference from balance decisionsPeter Zijlstra2008-06-271-9/+3Star
| | | | | | | | | | Priority looses much of its meaning in a hierarchical context. So don't use it in balance decisions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix task_h_load()Peter Zijlstra2008-06-271-9/+40
| | | | | | | | | | | | | | | | | | | | | | | | | Currently task_h_load() computes the load of a task and uses that to either subtract it from the total, or add to it. However, removing or adding a task need not have any effect on the total load at all. Imagine adding a task to a group that is local to one cpu - in that case the total load of that cpu is unaffected. So properly compute addition/removal: s_i = S * rw_i / \Sum_j rw_j s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg) then s'_i - s_i gives the change in load. Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight for that cpu, wl the weight we add (subtract) and wg the weight contribution to the runqueue. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix load scaling in group balancingPeter Zijlstra2008-06-271-4/+6
| | | | | | | | | | | | | | doing the load balance will change cfs_rq->load.weight (that's the whole point) but since that's part of the scale factor, we'll scale back with a different amount. Weight getting smaller would result in an inflated moved_load which causes it to stop balancing too soon. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: hierarchical load vs find_busiest_groupPeter Zijlstra2008-06-271-3/+23
| | | | | | | | | | | find_busiest_group() has some assumptions about task weight being in the NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this by replacing it with the average task weight, which will adapt the situation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: hierarchical load vs affine wakeupsPeter Zijlstra2008-06-271-2/+21
| | | | | | | | | | With hierarchical grouping we can't just compare task weight to rq weight - we need to scale the weight appropriately. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: persistent average load per taskPeter Zijlstra2008-06-271-13/+12Star
| | | | | | | | | | | Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of cpu_avg_load_per_task() - this is useful because of the hierarchical group model in which task weight can be much smaller. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix sched_balance_self() smp group balancingPeter Zijlstra2008-06-271-0/+3
| | | | | | | | | Finding the least idle cpu is more accurate when done with updated shares. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix newidle smp group balancingPeter Zijlstra2008-06-271-0/+13
| | | | | | | | | | Re-compute the shares on newidle - so we can make a decision based on recent data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: simplify the group load balancerPeter Zijlstra2008-06-272-229/+72Star
| | | | | | | | | | | | | | | | | | | | | While thinking about the previous patch - I realized that using per domain aggregate load values in load_balance_fair() is wrong. We should use the load value for that CPU. By not needing per domain hierarchical load values we don't need to store per domain aggregate shares, which greatly simplifies all the math. It basically falls apart in two separate computations: - per domain update of the shares - per CPU update of the hierarchical load Also get rid of the move_group_shares() stuff - just re-compute the shares again after a successful load balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: no need to aggregate task_weightPeter Zijlstra2008-06-272-16/+2Star
| | | | | | | | | | We only need to know the task_weight of the busiest rq - nothing to do if there are no tasks there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: dont micro manage share lossesPeter Zijlstra2008-06-271-23/+3Star
| | | | | | | | | | | We used to try and contain the loss of 'shares' by playing arithmetic games. Replace that by noticing that at the top sched_domain we'll always have the full weight in shares to distribute. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: kill task_group balancingSrivatsa Vaddagiri2008-06-271-13/+2Star
| | | | | | | | | | | | | | | The idea was to balance groups until we've reached the global goal, however Vatsa rightly pointed out that we might never reach that goal this way - hence take out this logic. [ the initial rationale for this 'feature' was to promote max concurrency within a group - it does not however affect fairness ] Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: update aggregate when holding the RQsPeter Zijlstra2008-06-271-0/+20
| | | | | | | | | | | | | | | | It was observed that in __update_group_shares_cpu() rq_weight > aggregate()->rq_weight This is caused by forks/wakeups in between the initial aggregate pass and locking of the RQs for load balance. To avoid this situation partially re-do the aggregation once we have the RQs locked (which avoids new tasks from appearing). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix sched_domain aggregationPeter Zijlstra2008-06-273-66/+60Star
| | | | | | | | | | | Keeping the aggregate on the first cpu of the sched domain has two problems: - it could collide between different sched domains on different cpus - it could slow things down because of the remote accesses Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add full schedstats to /proc/sched_debugPeter Zijlstra2008-06-271-2/+17
| | | | | | | | | show all the schedstats in /debug/sched_debug as well. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix wakeup granularity and buddy granularityPeter Zijlstra2008-06-272-8/+8
| | | | | | | | | | | | | | Uncouple buddy selection from wakeup granularity. The initial idea was that buddies could run ahead as far as a normal task can - do this by measuring a pair 'slice' just as we do for a normal task. This means we can drop the wakeup_granularity back to 5ms. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: sched_clock_cpu() based cpu_clock()Peter Zijlstra2008-06-272-76/+12Star
| | | | | | | | | | with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy difference) use this to provide cpu_clock(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: revert revert of: fair-group: SMP-nice for group schedulingPeter Zijlstra2008-06-275-75/+489
| | | | | | | | | | | | Try again.. Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix calc_delta_asym, #2Peter Zijlstra2008-06-271-3/+6
| | | | | | | | | | | | | | | | | Ok, so why are we in this mess, it was: 1/w but now we mixed that rw in the mix like: rw/w rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm? Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix calc_delta_asym()Peter Zijlstra2008-06-272-1/+28
| | | | | | | | | | | | | | | | | | | calc_delta_asym() is supposed to do the same as calc_delta_fair() except linearly shrink the result for negative nice processes - this causes them to have a smaller preemption threshold so that they are more easily preempted. The problem is that for task groups se->load.weight is the per cpu share of the actual task group weight; take that into account. Also provide a debug switch to disable the asymmetry (which I still don't like - but it does greatly benefit some workloads) This would explain the interactivity issues reported against group scheduling. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: revert the revert of: weight calculationsPeter Zijlstra2008-06-273-39/+76
| | | | | | | | | | | | Try again.. initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up some unused variablesPeter Zijlstra2008-06-271-2/+0Star
| | | | | | | | | | | | | In file included from /mnt/build/linux-2.6/kernel/sched.c:1496: /mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime': /mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd' /mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity': /mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue' Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'linus' into sched/develIngo Molnar2008-06-2548-498/+644
|\ | | | | | | | | | | | | | | Conflicts: kernel/sched_rt.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * Linux 2.6.26-rc8Linus Torvalds2008-06-251-1/+1
| |
| * Merge branch 'release' of ↵Linus Torvalds2008-06-253-5/+2Star
| |\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] Eliminate NULL test after alloc_bootmem in iosapic_alloc_rte() [IA64] Handle count==0 in sn2_ptc_proc_write() [IA64] Fix boot failure on ia64/sn2
| | * [IA64] Eliminate NULL test after alloc_bootmem in iosapic_alloc_rte()Julia Lawall2008-06-241-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | As noted by Akinobu Mita alloc_bootmem and related functions never return NULL and always return a zeroed region of memory. Thus a NULL test or memset after calls to these functions is unnecessary. Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Tony Luck <tony.luck@intel.com>
| | * [IA64] Handle count==0 in sn2_ptc_proc_write()Cliff Wickman2008-06-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fix applied in e0c6d97c65e0784aade7e97b9411f245a6c543e7 "security hole in sn2_ptc_proc_write" didn't take into account the case where count==0 (which results in a buffer underrun when adding the trailing '\0'). Thanks to Andi Kleen for pointing this out. Signed-off-by: Cliff Wickman <cpw@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
| | * [IA64] Fix boot failure on ia64/sn2Jes Sorensen2008-06-241-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call check_sal_cache_flush() after platform_setup() as check_sal_cache_flush() now relies on being able to call platform vector code. Problem was introduced by: 3463a93def55c309f3c0d0a8aaf216be3be42d64 "Update check_sal_cache_flush to use platform_send_ipi()" Signed-off-by: Jes Sorensen <jes@sgi.com> Tested-by: Alex Chiang: <achiang@hp.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds2008-06-252-15/+10Star
| |\ \ | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: [GFS2] fix gfs2 block allocation (cleaned up) [GFS2] BUG: unable to handle kernel paging request at ffff81002690e000
| | * | [GFS2] fix gfs2 block allocation (cleaned up)Benjamin Marzinski2008-06-241-14/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes bz 450641. This patch changes the computation for zero_metapath_length(), which it renames to metapath_branch_start(). When you are extending the metadata tree, The indirect blocks that point to the new data block must either diverge from the existing tree either at the inode, or at the first indirect block. They can diverge at the first indirect block because the inode has room for 483 pointers while the indirect blocks have room for 509 pointers, so when the tree is grown, there is some free space in the first indirect block. What metapath_branch_start() now computes is the height where the first indirect block for the new data block is located. It can either be 1 (if the indirect block diverges from the inode) or 2 (if it diverges from the first indirect block). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| | * | [GFS2] BUG: unable to handle kernel paging request at ffff81002690e000Bob Peterson2008-06-241-1/+1
| | |/ | | | | | | | | | | | | | | | | | | | | | This patch fixes bugzilla bug bz448866: gfs2: BUG: unable to handle kernel paging request at ffff81002690e000. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | Merge branch 'kvm-updates-2.6.26' of ↵Linus Torvalds2008-06-2518-266/+358
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm * 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: KVM: Remove now unused structs from kvm_para.h x86: KVM guest: Use the paravirt clocksource structs and functions KVM: Make kvm host use the paravirt clocksource structs x86: Make xen use the paravirt clocksource structs and functions x86: Add structs and functions for paravirt clocksource KVM: VMX: Fix host msr corruption with preemption enabled KVM: ioapic: fix lost interrupt when changing a device's irq KVM: MMU: Fix oops on guest userspace access to guest pagetable KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend) KVM: MMU: Fix rmap_write_protect() hugepage iteration bug KVM: close timer injection race window in __vcpu_run KVM: Fix race between timer migration and vcpu migration
| | * | KVM: Remove now unused structs from kvm_para.hGerd Hoffmann2008-06-241-18/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvm_* structs are obsoleted by the pvclock_* ones. Now all users have been switched over and the old structs can be dropped. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | x86: KVM guest: Use the paravirt clocksource structs and functionsGerd Hoffmann2008-06-242-56/+34Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates the kvm host code to use the pvclock structs and functions, thereby making it compatible with Xen. The patch also fixes an initialization bug: on SMP systems the per-cpu has two different locations early at boot and after CPU bringup. kvmclock must take that in account when registering the physical address within the host. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: Make kvm host use the paravirt clocksource structsGerd Hoffmann2008-06-242-14/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates the kvm host code to use the pvclock structs. It also makes the paravirt clock compatible with Xen. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | x86: Make xen use the paravirt clocksource structs and functionsGerd Hoffmann2008-06-243-124/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates the xen guest to use the pvclock structs and helper functions. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | x86: Add structs and functions for paravirt clocksourceGerd Hoffmann2008-06-245-0/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds structs for the paravirt clocksource ABI used by both xen and kvm (pvclock-abi.h). It also adds some helper functions to read system time and wall clock time from a paravirtual clocksource (pvclock.[ch]). They are based on the xen code. They are enabled using CONFIG_PARAVIRT_CLOCK. Subsequent patches of this series will put the code in use. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: VMX: Fix host msr corruption with preemption enabledAvi Kivity2008-06-241-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switching msrs can occur either synchronously as a result of calls to the msr management functions (usually in response to the guest touching virtualized msrs), or asynchronously when preempting a kvm thread that has guest state loaded. If we're unlucky enough to have the two at the same time, host msrs are corrupted and the machine goes kaput on the next syscall. Most easily triggered by Windows Server 2008, as it does a lot of msr switching during bootup. Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: ioapic: fix lost interrupt when changing a device's irqAvi Kivity2008-06-241-20/+11Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ioapic acknowledge path translates interrupt vectors to irqs. It currently uses a first match algorithm, stopping when it finds the first redirection table entry containing the vector. That fails however if the guest changes the irq to a different line, leaving the old redirection table entry in place (though masked). Result is interrupts not making it to the guest. Fix by always scanning the entire redirection table. Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: MMU: Fix oops on guest userspace access to guest pagetableAvi Kivity2008-06-241-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM has a heuristic to unshadow guest pagetables when userspace accesses them, on the assumption that most guests do not allow userspace to access pagetables directly. Unfortunately, in addition to unshadowing the pagetables, it also oopses. This never triggers on ordinary guests since sane OSes will clear the pagetables before assigning them to userspace, which will trigger the flood heuristic, unshadowing the pagetables before the first userspace access. One particular guest, though (Xenner) will run the kernel in userspace, triggering the oops. Since the heuristic is incorrect in this case, we can simply remove it. Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)Marcelo Tosatti2008-06-241-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed guests properly. It will instantiate two 2MB sptes pointing to the same physical 2MB page when a guest large pte update is trapped. Instead of duplicating code to handle this, disallow directory level updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes emulating one guest 4MB pte can be correctly created by the page fault handling path. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: MMU: Fix rmap_write_protect() hugepage iteration bugMarcelo Tosatti2008-06-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rmap_next() does not work correctly after rmap_remove(), as it expects the rmap chains not to change during iteration. Fix (for now) by restarting iteration from the beginning. Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: close timer injection race window in __vcpu_runMarcelo Tosatti2008-06-244-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a timer fires after kvm_inject_pending_timer_irqs() but before local_irq_disable() the code will enter guest mode and only inject such timer interrupt the next time an unrelated event causes an exit. It would be simpler if the timer->pending irq conversion could be done with IRQ's disabled, so that the above problem cannot happen. For now introduce a new vcpu requests bit to cancel guest entry. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| | * | KVM: Fix race between timer migration and vcpu migrationMarcelo Tosatti2008-06-241-12/+3Star
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A guest vcpu instance can be scheduled to a different physical CPU between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable(). If that happens, the timer will only be migrated to the current pCPU on the next exit, meaning that guest LAPIC timer event can be delayed until a host interrupt is triggered. Fix it by cancelling guest entry if any vcpu request is pending. This has the side effect of nicely consolidating vcpu->requests checks. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdogLinus Torvalds2008-06-241-1/+0Star
| |\ \ | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog: Revert "[WATCHDOG] hpwdt: Add CFLAGS to get driver working"
| | * | Revert "[WATCHDOG] hpwdt: Add CFLAGS to get driver working"Wim Van Sebroeck2008-06-241-1/+0Star
| | |/ | | | | | | | | | | | | | | | | | | | | | After Linus fixed the inline assembly, the CFLAGS option is not needed anymore. Signed-off-by: Thomas Mingarelli <Thomas.Mingarelli@hp.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
| * | Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2008-06-246-77/+27Star
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: xen: remove support for non-PAE 32-bit