summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm: workingset: fix premature shadow node shrinking with cgroupsJohannes Weiner2017-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") enabled cgroup-awareness in the shadow node shrinker, but forgot to also enable cgroup-awareness in the list_lru the shadow nodes sit on. Consequently, all shadow nodes are sitting on a global (per-NUMA node) list, while the shrinker applies the limits according to the amount of cache in the cgroup its shrinking. The result is excessive pressure on the shadow nodes from cgroups that have very little cache. Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits are applied to per-cgroup lists. Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware") Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rmap: fix huge file mmap accounting in the memcg statsJohannes Weiner2017-04-012-2/+8
| | | | | | | | | | | | | | | Huge pages are accounted as single units in the memcg's "file_mapped" counter. Account the correct number of base pages, like we do in the corresponding node counter. Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move mm_percpu_wq initialization earlierMichal Hocko2017-04-013-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yang Li has reported that drain_all_pages triggers a WARN_ON which means that this function is called earlier than the mm_percpu_wq is initialized on arm64 with CMA configured: WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127 Hardware name: Freescale Layerscape 2088A RDB Board (DT) task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000 PC is at drain_all_pages+0x244/0x25c LR is at start_isolate_page_range+0x14c/0x1f0 [...] drain_all_pages+0x244/0x25c start_isolate_page_range+0x14c/0x1f0 alloc_contig_range+0xec/0x354 cma_alloc+0x100/0x1fc dma_alloc_from_contiguous+0x3c/0x44 atomic_pool_init+0x7c/0x208 arm64_dma_init+0x44/0x4c do_one_initcall+0x38/0x128 kernel_init_freeable+0x1a0/0x240 kernel_init+0x10/0xfc ret_from_fork+0x10/0x20 Fix this by moving the whole setup_vmstat which is an initcall right now to init_mm_internals which will be called right after the WQ subsystem is initialized. Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Yang Li <pku.leo@gmail.com> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: migrate: fix remove_migration_pte() for ksm pagesNaoya Horiguchi2017-04-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found that calling page migration for ksm pages causes the following bug: page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913 flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked) raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001 raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800 page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff88007d81b800 ------------[ cut here ]------------ kernel BUG at /src/linux-dev/mm/rmap.c:1086! invalid opcode: 0000 [#1] SMP Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260 RSP: 0018:ffffc90002473b30 EFLAGS: 00010282 RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0 RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1 R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80 R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000 FS: 00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0 Call Trace: page_add_anon_rmap+0x18/0x20 remove_migration_pte+0x220/0x2c0 rmap_walk_ksm+0x143/0x220 rmap_walk+0x55/0x60 remove_migration_ptes+0x53/0x80 migrate_pages+0x8ed/0xb60 soft_offline_page+0x309/0x8d0 store_soft_offline_page+0xaf/0xf0 dev_attr_store+0x18/0x30 sysfs_kf_write+0x3a/0x50 kernfs_fop_write+0xff/0x180 __vfs_write+0x37/0x160 vfs_write+0xb2/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x67/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f51956339e0 RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0 RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001 RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740 R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400 R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000 Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff <0f> 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48 RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30 ---[ end trace a679d00f4af2df48 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception The problem is in the following lines: new = page - pvmw.page->index + linear_page_index(vma, pvmw.address); The 'new' is calculated with 'page' which is given by the caller as a destination page and some offset adjustment for thp. But this doesn't properly work for ksm pages because pvmw.page->index doesn't change for each address but linear_page_index() changes, which means that 'new' points to different pages for each addresses backed by the ksm page. As a result, we try to set totally unrelated pages as destination pages, and that causes kernel crash. This patch fixes the miscalculation and makes ksm page migration work fine. Fixes: 3fe87967c536 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()") Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'pci-v4.11-fixes-3' of ↵Linus Torvalds2017-03-314-24/+78
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - fix iProc memory corruption - fix ThunderX usage of unregistered PNP/ACPI ID - fix ThunderX resource reservation on early firmware * tag 'pci-v4.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: thunder-pem: Add legacy firmware support for Cavium ThunderX host controller PCI: thunder-pem: Use Cavium assigned hardware ID for ThunderX host controller PCI: iproc: Save host bridge window resource in struct iproc_pcie
| * PCI: thunder-pem: Add legacy firmware support for Cavium ThunderX host ↵Tomasz Nowicki2017-03-231-2/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | controller During early days of PCI quirks support, ThunderX firmware did not provide PNP0c02 node with PCI configuration space and PEM-specific register ranges. This means that for legacy FW we are not reserving these resources and cannot gather PEM-specific resources for further PEM initialization. To support already deployed legacy FW, calculate PEM-specific ranges and provide resources reservation as fallback scenario into PEM driver when we could not gather PEM reg base from ACPI tables. Tested-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Tomasz Nowicki <tn@semihalf.com> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@caviumnetworks.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Robert Richter <rrichter@cavium.com> CC: stable@vger.kernel.org # v4.10+
| * PCI: thunder-pem: Use Cavium assigned hardware ID for ThunderX host controllerTomasz Nowicki2017-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | "CAV" is the only PNP/ACPI hardware ID vendor prefix assigned to Cavium so fix this as it should be from day one. Fixes: 44f22bd91e88 ("PCI: Add MCFG quirks for Cavium ThunderX pass2.x host controller") Tested-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Tomasz Nowicki <tn@semihalf.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Robert Richter <rrichter@cavium.com> CC: stable@vger.kernel.org # v4.10+
| * PCI: iproc: Save host bridge window resource in struct iproc_pcieBjorn Helgaas2017-03-093-21/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The host bridge memory window resource is inserted into the iomem_resource tree and cannot be deallocated until the host bridge itself is removed. Previously, the window was on the stack, which meant the iomem_resource entry pointed into the stack and was corrupted as soon as the probe function returned, which caused memory corruption and errors like this: pcie_iproc_bcma bcma0:8: resource collision: [mem 0x40000000-0x47ffffff] conflicts with PCIe MEM space [mem 0x40000000-0x47ffffff] Move the memory window resource from the stack into struct iproc_pcie so its lifetime matches that of the host bridge. Fixes: c3245a566400 ("PCI: iproc: Request host bridge window resources") Reported-and-tested-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v4.8+
* | Merge branch 'for-rc' of ↵Linus Torvalds2017-03-302-19/+34
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux Pull thermal management fixes from Zhang Rui: - Fix a potential deadlock in cpu_cooling driver, which was introduced in 4.11-rc1. (Matthew Wilcox) - Fix the cpu_cooling and devfreq_cooling code to handle possible error return value from OPP calls, together with three minor fixes in the same patch series. (Viresh Kumar) * 'for-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: thermal: cpu_cooling: Check OPP for errors thermal: cpu_cooling: Replace dev_warn with dev_err thermal: devfreq: Check OPP for errors thermal: devfreq_cooling: Replace dev_warn with dev_err thermal: devfreq: Simplify expression thermal: Fix potential deadlock in cpu_cooling
| * | thermal: cpu_cooling: Check OPP for errorsViresh Kumar2017-03-131-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible for dev_pm_opp_find_freq_exact() to return errors. It was all fine earlier as dev_pm_opp_get_voltage() had a check within it to check for invalid OPPs, but dev_pm_opp_put() doesn't have any similar checks and the callers need to make sure OPP is valid before calling them. Also update the later dev_warn_ratelimited() to not print the error message as the OPP is guaranteed to be valid now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | thermal: cpu_cooling: Replace dev_warn with dev_errViresh Kumar2017-03-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | There isn't much the user can do on seeing these warnings, as the hardware is actually okay. dev_err suits much better here. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | thermal: devfreq: Check OPP for errorsViresh Kumar2017-03-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible for dev_pm_opp_find_freq_exact() to return errors. It was all fine earlier as dev_pm_opp_get_voltage() had a check within it to check for invalid OPPs, but dev_pm_opp_put() doesn't have any similar checks and the callers need to make sure OPP is valid before calling them. Also update the later dev_warn_ratelimited() to not print the error message as the OPP is guaranteed to be valid now. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | thermal: devfreq_cooling: Replace dev_warn with dev_errViresh Kumar2017-03-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | There isn't much the user can do on seeing this warning, as the hardware is actually okay. dev_err suits much better here. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | thermal: devfreq: Simplify expressionViresh Kumar2017-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to check for IS_ERR() as we are looking for a very particular error value here. Drop the first check. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
| * | thermal: Fix potential deadlock in cpu_coolingMatthew Wilcox2017-03-131-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cooling_list_lock is covering not just cpufreq_dev_count, but also the calls to cpufreq_register_notifier() and cpufreq_unregister_notifier(). Since cooling_list_lock is also used within cpufreq_thermal_notifier(), lockdep reports a potential deadlock. Fix it by testing the condition under cooling_list_lock and dropping the lock before calling cpufreq_register_notifier(). And variable cpufreq_dev_count is removed at the same time, because it's no longer needed after the fix. Fixes: ae606089621e ("thermal: convert cpu_cooling to use an IDA") Reported-and-Tested-by: Russell King <linux@armlinux.org.uk> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com Signed-off-by: Zhang Rui <rui.zhang@intel.com>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-03-292-36/+107
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Five fixes for this series: - a fix from me to ensure that blk-mq drivers that terminate IO in their ->queue_rq() handler by returning QUEUE_ERROR don't stall with a scheduler enabled. - four nbd fixes from Josef and Ratna, fixing various problems that are critical enough to go in for this cycle. They have been well tested" * 'for-linus' of git://git.kernel.dk/linux-block: nbd: replace kill_bdev() with __invalidate_device() nbd: set queue timeout properly nbd: set rq->errors to actual error code nbd: handle ERESTARTSYS properly blk-mq: include errors in did_work calculation
| * | | nbd: replace kill_bdev() with __invalidate_device()Ratna Manoj Bolla2017-03-241-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a filesystem is mounted on a nbd device and on a disconnect, because of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are getting destroyed under mounted filesystem. After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect, followed by a sys_umount(), generic_shutdown_super()->... ->__sync_blockdev()->... -blkdev_writepages()->... ->do_invalidatepage()->... -discard_buffer() is discarding superblock buffer_head assumed to be in mapped state by ext4_commit_super(). [mlin: ported to 4.11-rc2] Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | nbd: set queue timeout properlyJosef Bacik2017-03-241-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't just set the timeout on the tagset, we have to set it on the queue as it would have been setup already at this point. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | nbd: set rq->errors to actual error codeJosef Bacik2017-03-241-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've been relying on the block layer to assume rq->errors being set translates into -EIO. I noticed in testing that sometimes this isn't true, and really there's not much of a reason to have a counter instead of just using -EIO. So set it properly so we don't leak random numbers to unsuspecting victims. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | nbd: handle ERESTARTSYS properlyJosef Bacik2017-03-241-26/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can submit IO in a processes context, which means there can be pending signals. This isn't a fatal error for NBD, but it does require some finesse. If the signal happens before we transmit anything then we are ok, just requeue the request and carry on. However if we've done a partial transmit we can't allow anything else to be transmitted on this socket until we transmit the remaining part of the request. Deal with this by keeping track of how much we've sent for the current request, and if we get an ERESTARTSYS during any part of our transmission save the state of that request and requeue the IO. If anybody tries to submit a request that isn't our pending request then requeue that request until we are able to service the one that is pending. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | blk-mq: include errors in did_work calculationJens Axboe2017-03-241-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge branch 'apw' (xfrm_user fixes)Linus Torvalds2017-03-291-1/+8
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge xfrm_user validation fixes from Andy Whitcroft: "Two patches we are applying to Ubuntu for XFRM_MSG_NEWAE validation issue reported by ZDI. The first of these is the primary fix, and the second is for a more theoretical issue that Kees pointed out when reviewing the first" * emailed patches from Andy Whitcroft <apw@canonical.com>: xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
| * | | | xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harderAndy Whitcroft2017-03-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to wrapping issues. To ensure we are correctly ensuring that the two ESN structures are the same size compare both the overall size as reported by xfrm_replay_state_esn_len() and the internal length are the same. CVE-2017-7184 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_windowAndy Whitcroft2017-03-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a new xfrm state is created during an XFRM_MSG_NEWSA call we validate the user supplied replay_esn to ensure that the size is valid and to ensure that the replay_window size is within the allocated buffer. However later it is possible to update this replay_esn via a XFRM_MSG_NEWAE call. There we again validate the size of the supplied buffer matches the existing state and if so inject the contents. We do not at this point check that the replay_window is within the allocated memory. This leads to out-of-bounds reads and writes triggered by netlink packets. This leads to memory corruption and the potential for priviledge escalation. We already attempt to validate the incoming replay information in xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user is not trying to change the size of the replay state buffer which includes the replay_esn. It however does not check the replay_window remains within that buffer. Add validation of the contained replay_window. CVE-2017-7184 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge branch 'regset' (PTRACE_SETREGSET data leakage)Linus Torvalds2017-03-295-50/+23Star
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge PTRACE_SETREGSET leakage fixes from Dave Martin: "This series is the collection of fixes I proposed on this topic, that have not yet appeared upstream or in the stable branches, The issue can leak kernel stack, but doesn't appear to allow userspace to attack the kernel directly. The affected architectures are c6x, h8300, metag, mips and sparc. [ Mark Salter points out that c6x has no MMU or other mechanism to prevent userspace access to kernel code or data on c6x, but it doesn't hurt to clean that case up too. ] The bugs arise from use of user_regset_copyin(). Users of user_regset_copyin() can work in one of two ways: 1) Copy directly to thread_struct or equivalent. (This seems to be the design assumption of the regset API, and is the most common approach.) 2) Copy to a local variable and then transfer to thread_struct. (A significant minority of cases.) Buggy code typically involves approach 2" * emailed patches from Dave Martin <Dave.Martin@arm.com>: sparc/ptrace: Preserve previous registers for short regset write mips/ptrace: Preserve previous registers for short regset write metag/ptrace: Reject partial NT_METAG_RPIPE writes metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS metag/ptrace: Preserve previous registers for short regset write h8300/ptrace: Fix incorrect register transfer count c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
| * | | | | sparc/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | mips/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | metag/ptrace: Reject partial NT_METAG_RPIPE writesDave Martin2017-03-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not clear what behaviour is sensible when doing partial write of NT_METAG_RPIPE, so just don't bother. This patch assumes that userspace will never rely on a partial SETREGSET in this case, since it's not clear what should happen anyway. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUSDave Martin2017-03-291-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill TXSTATUS, a well-defined default value is used, based on the task's current value. Suggested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | metag/ptrace: Preserve previous registers for short regset writeDave Martin2017-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET to fill all the registers, the thread's old registers are preserved. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | h8300/ptrace: Fix incorrect register transfer countDave Martin2017-03-291-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | regs_set() and regs_get() are vulnerable to an off-by-1 buffer overrun if CONFIG_CPU_H8S is set, since this adds an extra entry to register_offset[] but not to user_regs_struct. So, iterate over user_regs_struct based on its actual size, not based on the length of register_offset[]. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | c6x/ptrace: Remove useless PTRACE_SETREGSET implementationDave Martin2017-03-291-41/+0Star
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gpr_set won't work correctly and can never have been tested, and the correct behaviour is not clear due to the endianness-dependent task layout. So, just remove it. The core code will now return -EOPNOTSUPPORT when trying to set NT_PRSTATUS on this architecture until/unless a correct implementation is supplied. Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2017-03-282-10/+18
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio fixes from Michael Tsirkin: "Fixes to multiple issues in virtio. Most notably a regression fix for crashes reported by Fedora users. Hibernate is still reportedly broken, working on it" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_balloon: prevent uninitialized variable use virtio-balloon: use actual number of stats for stats queue buffers virtio_balloon: init 1st buffer in stats vq virtio_pci: fix out of bound access for msix_names
| * | | | | virtio_balloon: prevent uninitialized variable useArnd Bergmann2017-03-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The latest gcc-7.0.1 snapshot reports a new warning: virtio/virtio_balloon.c: In function 'update_balloon_stats': virtio/virtio_balloon.c:258:26: error: 'events[2]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:260:26: error: 'events[3]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:261:56: error: 'events[18]' is used uninitialized in this function [-Werror=uninitialized] virtio/virtio_balloon.c:262:56: error: 'events[17]' is used uninitialized in this function [-Werror=uninitialized] This seems absolutely right, so we should add an extra check to prevent copying uninitialized stack data into the statistics. >From all I can tell, this has been broken since the statistics code was originally added in 2.6.34. Fixes: 9564e138b1f6 ("virtio: Add memory statistics reporting to the balloon driver (V4)") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * | | | | virtio-balloon: use actual number of stats for stats queue buffersLadi Prosek2017-03-281-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The virtio balloon driver contained a not-so-obvious invariant that update_balloon_stats has to update exactly VIRTIO_BALLOON_S_NR counters in order to send valid stats to the host. This commit fixes it by having update_balloon_stats return the actual number of counters, and its callers use it when pushing buffers to the stats virtqueue. Note that it is still out of spec to change the number of counters at run-time. "Driver MUST supply the same subset of statistics in all buffers submitted to the statsq." Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * | | | | virtio_balloon: init 1st buffer in stats vqLadi Prosek2017-03-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When init_vqs runs, virtio_balloon.stats is either uninitialized or contains stale values. The host updates its state with garbage data because it has no way of knowing that this is just a marker buffer used for signaling. This patch updates the stats before pushing the initial buffer. Alternative fixes: * Push an empty buffer in init_vqs. Not easily done with the current virtio implementation and violates the spec "Driver MUST supply the same subset of statistics in all buffers submitted to the statsq". * Push a buffer with invalid tags in init_vqs. Violates the same spec clause, plus "invalid tag" is not really defined. Note: the spec says: When using the legacy interface, the device SHOULD ignore all values in the first buffer in the statsq supplied by the driver after device initialization. Note: Historically, drivers supplied an uninitialized buffer in the first buffer. Unfortunately QEMU does not seem to implement the recommendation even for the legacy interface. Cc: stable@vger.kernel.org Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * | | | | virtio_pci: fix out of bound access for msix_namesJason Wang2017-03-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fedora has received multiple reports of crashes when running 4.11 as a guest https://bugzilla.redhat.com/show_bug.cgi?id=1430297 https://bugzilla.redhat.com/show_bug.cgi?id=1434462 https://bugzilla.kernel.org/show_bug.cgi?id=194911 https://bugzilla.redhat.com/show_bug.cgi?id=1433899 The crashes are not always consistent but they are generally some flavor of oops or GPF in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones) and found this commit to be at fault 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 is the first bad commit commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 Author: Christoph Hellwig <hch@lst.de> Date: Sun Feb 5 18:15:19 2017 +0100 virtio_pci: use shared interrupts for virtqueues The issue seems to be an out of bounds access to the msix_names array corrupting kernel memory. Fixes: 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
* | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2017-03-2812-35/+153
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "All x86-specific, apart from some arch-independent syzkaller fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: cleanup the page tracking SRCU instance KVM: nVMX: fix nested EPT detection KVM: pci-assign: do not map smm memory slot pages in vt-d page tables KVM: kvm_io_bus_unregister_dev() should never fail KVM: VMX: Fix enable VPID conditions KVM: nVMX: Fix nested VPID vmx exec control KVM: x86: correct async page present tracepoint kvm: vmx: Flush TLB when the APIC-access address changes KVM: x86: use pic/ioapic destructor when destroy vm KVM: x86: check existance before destroy KVM: x86: clear bus pointer when destroyed KVM: Documentation: document MCE ioctls KVM: nVMX: don't reset kvm mmu twice PTP: fix ptr_ret.cocci warnings kvm: fix usage of uninit spinlock in avic_vm_destroy() KVM: VMX: downgrade warning on unexpected exit code
| * | | | | | KVM: x86: cleanup the page tracking SRCU instancePaolo Bonzini2017-03-283-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SRCU uses a delayed work item. Skip cleaning it up, and the result is use-after-free in the work item callbacks. Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Cc: stable@vger.kernel.org Fixes: 0eb05bf290cfe8610d9680b49abef37febd1c38a Reviewed-by: Xiao Guangrong <xiaoguangrong.eric@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: nVMX: fix nested EPT detectionLadi Prosek2017-03-281-4/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nested_ept_enabled flag introduced in commit 7ca29de2136 was not computed correctly. We are interested only in L1's EPT state, not the the combined L0+L1 value. In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must be false to make sure that PDPSTRs are loaded based on CR3 as usual, because the special case described in 26.3.2.4 Loading Page-Directory- Pointer-Table Entries does not apply. Fixes: 7ca29de21362 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT") Cc: qemu-stable@nongnu.org Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: pci-assign: do not map smm memory slot pages in vt-d page tablesHerongguang (Stephen)2017-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | or VM memory are not put thus leaked in kvm_iommu_unmap_memslots() when destroy VM. This is consistent with current vfio implementation. Signed-off-by: herongguang <herongguang.he@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: kvm_io_bus_unregister_dev() should never failDavid Hildenbrand2017-03-233-20/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No caller currently checks the return value of kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on freeing their device. A stale reference will remain in the io_bus, getting at least used again, when the iobus gets teared down on kvm_destroy_vm() - leading to use after free errors. There is nothing the callers could do, except retrying over and over again. So let's simply remove the bus altogether, print an error and make sure no one can access this broken bus again (returning -ENOMEM on any attempt to access it). Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU") Cc: stable@vger.kernel.org # 3.4+ Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: VMX: Fix enable VPID conditionsWanpeng Li2017-03-231-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can be reproduced by running L2 on L1, and disable VPID on L0 if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2 crash as below: KVM: entry failed, hardware error 0x7 EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3 ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 0000ffff 00009300 CS =f000 ffff0000 0000ffff 00009b00 SS =0000 00000000 0000ffff 00009300 DS =0000 00000000 0000ffff 00009300 FS =0000 00000000 0000ffff 00009300 GS =0000 00000000 0000ffff 00009300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 00000000 0000ffff IDT= 00000000 0000ffff CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Reference SDM 30.3 INVVPID: Protected Mode Exceptions - #UD - If not in VMX operation. - If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0). - If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0). So we should check both VPID enable bit in vmx exec control and INVVPID support bit in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable VPID if either INVVPID or single-context/all-context invalidation is not exposed in vmx capability MSRs. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: nVMX: Fix nested VPID vmx exec controlWanpeng Li2017-03-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can be reproduced by running kvm-unit-tests/vmx.flat on L0 w/ vpid disabled. Test suite: VPID Unhandled exception 6 #UD at ip 00000000004051a6 error_code=0000 rflags=00010047 cs=00000008 rax=0000000000000000 rcx=0000000000000001 rdx=0000000000000047 rbx=0000000000402f79 rbp=0000000000456240 rsi=0000000000000001 rdi=0000000000000000 r8=000000000000000a r9=00000000000003f8 r10=0000000080010011 r11=0000000000000000 r12=0000000000000003 r13=0000000000000708 r14=0000000000000000 r15=0000000000000000 cr0=0000000080010031 cr2=0000000000000000 cr3=0000000007fff000 cr4=0000000000002020 cr8=0000000000000000 STACK: @4051a6 40523e 400f7f 402059 40028f We should hide and forbid VPID in L1 if it is disabled on L0. However, nested VPID enable bit is set unconditionally during setup nested vmx exec controls though VPID is not exposed through nested VMX capablity. This patch fixes it by don't set nested VPID enable bit if it is disabled on L0. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Fixes: 5c614b3583e (KVM: nVMX: nested VPID emulation) Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: x86: correct async page present tracepointWanpeng Li2017-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After async pf setup successfully, there is a broadcast wakeup w/ special token 0xffffffff which tells vCPU that it should wake up all processes waiting for APFs though there is no real process waiting at the moment. The async page present tracepoint print prematurely and fails to catch the special token setup. This patch fixes it by moving the async page present tracepoint after the special token setup. Before patch: qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0x0 gva 0x0 After patch: qemu-system-x86-8499 [006] ...1 5973.473292: kvm_async_pf_ready: token 0xffffffff gva 0x0 Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | kvm: vmx: Flush TLB when the APIC-access address changesJim Mattson2017-03-231-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quoting from the Intel SDM, volume 3, section 28.3.3.4: Guidelines for Use of the INVEPT Instruction: If EPT was in use on a logical processor at one time with EPTP X, it is recommended that software use the INVEPT instruction with the "single-context" INVEPT type and with EPTP X in the INVEPT descriptor before a VM entry on the same logical processor that enables EPT with EPTP X and either (a) the "virtualize APIC accesses" VM-execution control was changed from 0 to 1; or (b) the value of the APIC-access address was changed. In the nested case, the burden falls on L1, unless L0 enables EPT in vmcs02 when L1 doesn't enable EPT in vmcs12. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | | | | KVM: x86: use pic/ioapic destructor when destroy vmPeter Xu2017-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have specific destructors for pic/ioapic, we'd better use them when destroying the VM as well. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | | | | KVM: x86: check existance before destroyPeter Xu2017-03-232-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mostly used for split irqchip mode. In that case, these two things are not inited at all, so no need to release. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | | | | KVM: x86: clear bus pointer when destroyedPeter Xu2017-03-211-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When releasing the bus, let's clear the bus pointers to mark it out. If any further device unregister happens on this bus, we know that we're done if we found the bus being released already. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | | | | KVM: Documentation: document MCE ioctlsLuiz Capitulino2017-03-201-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>