summaryrefslogtreecommitdiffstats
path: root/mm/vmalloc.c
Commit message (Collapse)AuthorAgeFilesLines
* mm/vmalloc.c: fix percpu free VM area search criteriaKuppuswamy Sathyanarayanan2019-08-141-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent changes to the vmalloc code by commit 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can cause spurious percpu allocation failures. These, in turn, can result in panic()s in the slub code. One such possible panic was reported by Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939. Another related panic observed is, RIP: 0033:0x7f46f7441b9b Call Trace: dump_stack+0x61/0x80 pcpu_alloc.cold.30+0x22/0x4f mem_cgroup_css_alloc+0x110/0x650 cgroup_apply_control_enable+0x133/0x330 cgroup_mkdir+0x41b/0x500 kernfs_iop_mkdir+0x5a/0x90 vfs_mkdir+0x102/0x1b0 do_mkdirat+0x7d/0xf0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly uses two lists (vmap_area_list & free_vmap_area_list) to track the used and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[], sizes[], nr_vms, align) function is used for allocating congruent VM areas for percpu memory allocator. In order to not conflict with VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC space. So the search for free vm_area for the given requirement starts near VMALLOC_END and moves upwards towards VMALLOC_START. Prior to commit 68ad4a330433, the search for free vm_area in pcpu_get_vm_areas() involves following two main steps. Step 1: Find a aligned "base" adress near VMALLOC_END. va = free vm area near VMALLOC_END Step 2: Loop through number of requested vm_areas and check, Step 2.1: if (base < VMALLOC_START) 1. fail with error Step 2.2: // end is offsets[area] + sizes[area] if (base + end > va->vm_end) 1. Move the base downwards and repeat Step 2 Step 2.3: if (base + start < va->vm_start) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below: Step 2.3: if (base + start < va->vm_start || base + end > va->vm_end) 1. Move to previous free vm_area node, find aligned base address and repeat Step 2 Above change is the root cause of spurious percpu memory allocation failures. For example, consider a case where a relatively large vm_area (~ 30 TB) was ignored in free vm_area search because it did not pass the base + end < vm->vm_end boundary check. Ignoring such large free vm_area's would lead to not finding free vm_area within boundary of VMALLOC_start to VMALLOC_END which in turn leads to allocation failures. So modify the search algorithm to include Step 2.2. Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()Joerg Roedel2019-07-221-0/+9
| | | | | | | | | | | | | | | | | | | | | | | On x86-32 with PTI enabled, parts of the kernel page-tables are not shared between processes. This can cause mappings in the vmalloc/ioremap area to persist in some page-tables after the region is unmapped and released. When the region is re-used the processes with the old mappings do not fault in the new mappings but still access the old ones. This causes undefined behavior, in reality often data corruption, kernel oopses and panics and even spontaneous reboots. Fix this problem by activly syncing unmaps in the vmalloc/ioremap area to all page-tables in the system before the regions can be re-used. References: https://bugzilla.suse.com/show_bug.cgi?id=1118689 Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-4-joro@8bytes.org
* mm: vmalloc: show number of vmalloc pages in /proc/meminfoRoman Gushchin2019-07-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | Vmalloc() is getting more and more used these days (kernel stacks, bpf and percpu allocator are new top users), and the total % of memory consumed by vmalloc() can be pretty significant and changes dynamically. /proc/meminfo is the best place to display this information: its top goal is to show top consumers of the memory. Since the VmallocUsed field in /proc/meminfo is not in use for quite a long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of 'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the actual physical memory consumption of vmalloc(). Link: http://lkml.kernel.org/r/20190417194002.12369-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: spelling> s/informaion/information/Geert Uytterhoeven2019-07-121-2/+2
| | | | | | | | | Link: http://lkml.kernel.org/r/20190607113509.15032-1-geert+renesas@glider.be Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()Uladzislau Rezki (Sony)2019-07-121-15/+10Star
| | | | | | | | | | | | | | | | | Trigger a warning if an object that is about to be freed is detached. We used to have a BUG_ON(), but even though it is considered as faulty behaviour that is not a good reason to break a system. Link: http://lkml.kernel.org/r/20190606120411.8298-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: get rid of one single unlink_va() when mergeUladzislau Rezki (Sony)2019-07-121-6/+2Star
| | | | | | | | | | | | | | | | | | | | It does not make sense to try to "unlink" the node that is definitely not linked with a list nor tree. On the first merge step VA just points to the previously disconnected busy area. On the second step, check if the node has been merged and do "unlink" if so, because now it points to an object that must be linked. Link: http://lkml.kernel.org/r/20190606120411.8298-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: preload a CPU with one object for split purposeUladzislau Rezki (Sony)2019-07-121-4/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor the NE_FIT_TYPE split case when it comes to an allocation of one extra object. We need it in order to build a remaining space. The preload is done per CPU in non-atomic context with GFP_KERNEL flags. More permissive parameters can be beneficial for systems which are suffer from high memory pressure or low memory condition. For example on my KVM system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N workers spinning on fork() and exit(), i can trigger below trace: <snip> [ 179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [ 179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003 [ 179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 179.815171] Call Trace: [ 179.815178] dump_stack+0x5c/0x7b [ 179.815182] warn_alloc+0x108/0x190 [ 179.815187] __alloc_pages_slowpath+0xdc7/0xdf0 [ 179.815191] __alloc_pages_nodemask+0x2de/0x330 [ 179.815194] cache_grow_begin+0x77/0x420 [ 179.815197] fallback_alloc+0x161/0x200 [ 179.815200] kmem_cache_alloc+0x1c9/0x570 [ 179.815202] alloc_vmap_area+0x32c/0x990 [ 179.815206] __get_vm_area_node+0xb0/0x170 [ 179.815208] __vmalloc_node_range+0x6d/0x230 [ 179.815211] ? _do_fork+0xce/0x3d0 [ 179.815213] copy_process.part.46+0x850/0x1b90 [ 179.815215] ? _do_fork+0xce/0x3d0 [ 179.815219] _do_fork+0xce/0x3d0 [ 179.815226] ? __do_page_fault+0x2bf/0x4e0 [ 179.815229] do_syscall_64+0x55/0x130 [ 179.815231] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 179.815234] RIP: 0033:0x7fedec4c738b ... [ 179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b [ 179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0 [ 179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000 [ 179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000 [ 179.815245] Mem-Info: [ 179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0 active_file:502 inactive_file:61 isolated_file:70 unevictable:2 dirty:0 writeback:0 unstable:0 slab_reclaimable:2380 slab_unreclaimable:7520 mapped:15069 shmem:14813 pagetables:10833 bounce:0 free:1922 free_pcp:229 free_cma:0 <snip> Link: http://lkml.kernel.org/r/20190606120411.8298-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: remove "node" argumentUladzislau Rezki (Sony)2019-07-121-2/+2
| | | | | | | | | | | | | | | | | | | | Patch series "Some cleanups for the KVA/vmalloc", v5. This patch (of 4): Remove unused argument from the __alloc_vmap_area() function. Link: http://lkml.kernel.org/r/20190606120411.8298-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/pgtable: drop pgtable_t variable from pte_fn_t functionsAnshuman Khandual2019-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Drop the pgtable_t variable from all implementation for pte_fn_t as none of them use it. apply_to_pte_range() should stop computing it as well. Should help us save some cycles. Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <jglisse@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'arm64-upstream' of ↵Linus Torvalds2019-07-081-11/+0Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP} - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to manage the permissions of executable vmalloc regions more strictly - Slight performance improvement by keeping softirqs enabled while touching the FPSIMD/SVE state (kernel_neon_begin/end) - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG and AXFLAG instructions for floating point comparison flags manipulation) and FRINT (rounding floating point numbers to integers) - Re-instate ARM64_PSEUDO_NMI support which was previously marked as BROKEN due to some bugs (now fixed) - Improve parking of stopped CPUs and implement an arm64-specific panic_smp_self_stop() to avoid warning on not being able to stop secondary CPUs during panic - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI platforms - perf: DDR performance monitor support for iMX8QXP - cache_line_size() can now be set from DT or ACPI/PPTT if provided to cope with a system cache info not exposed via the CPUID registers - Avoid warning on hardware cache line size greater than ARCH_DMA_MINALIGN if the system is fully coherent - arm64 do_page_fault() and hugetlb cleanups - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep) - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags' introduced in 5.1) - CONFIG_RANDOMIZE_BASE now enabled in defconfig - Allow the selection of ARM64_MODULE_PLTS, currently only done via RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill over into the vmalloc area - Make ZONE_DMA32 configurable * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits) perf: arm_spe: Enable ACPI/Platform automatic module loading arm_pmu: acpi: spe: Add initial MADT/SPE probing ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens ACPI/PPTT: Modify node flag detection to find last IDENTICAL x86/entry: Simplify _TIF_SYSCALL_EMU handling arm64: rename dump_instr as dump_kernel_instr arm64/mm: Drop [PTE|PMD]_TYPE_FAULT arm64: Implement panic_smp_self_stop() arm64: Improve parking of stopped CPUs arm64: Expose FRINT capabilities to userspace arm64: Expose ARMv8.5 CondM capability to userspace arm64: defconfig: enable CONFIG_RANDOMIZE_BASE arm64: ARM64_MODULES_PLTS must depend on MODULES arm64: bpf: do not allocate executable memory arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP arm64: module: create module allocations without exec permissions arm64: Allow user selection of ARM64_MODULE_PLTS acpi/arm64: ignore 5.1 FADTs that are reported as 5.0 arm64: Allow selecting Pseudo-NMI again ...
| * arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAPArd Biesheuvel2019-06-241-11/+0Star
| | | | | | | | | | | | | | | | | | Wire up the special helper functions to manipulate aliases of vmalloc regions in the linear map. Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
* | mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warningArnd Bergmann2019-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used: mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~ Add an intialization to NULL, and check whether this has changed before the first use. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm/vmalloc: Avoid rare case of flushing TLB with weird argumentsRick Edgecombe2019-06-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a rare case, flush_tlb_kernel_range() could be called with a start higher than the end. In vm_remove_mappings(), in case page_address() returns 0 for all pages (for example they were all in highmem), _vm_unmap_aliases() will be called with start = ULONG_MAX, end = 0 and flush = 1. If at the same time, the vmalloc purge operation is triggered by something else while the current operation is between remove_vm_area() and _vm_unmap_aliases(), then the vm mapping just removed will be already purged. In this case the call of vm_unmap_aliases() may not find any other mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So only set flush = true if we find something in the direct mapping that we need to flush, and this way this can't happen. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Meelis Roos <mroos@linux.ee> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions") Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | mm/vmalloc: Fix calculation of direct map addr rangeRick Edgecombe2019-06-031-5/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The calculation of the direct map address range to flush was wrong. This could cause the RO direct map alias to not get flushed. Today this shouldn't be a problem because this flush is only needed on x86 right now and the spurious fault handler will fix cached RO->RW translations. In the future though, it could cause the permissions to remain RO in the TLB for the direct map alias, and then the page would return from the page allocator to some other component as RO and cause a crash. So fix fix the address range calculation so the flush will include the direct map range. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Meelis Roos <mroos@linux.ee> Cc: Nadav Amit <namit@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions") Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm/vmalloc.c: fix typo in commentAndrew Morton2019-06-021-1/+1
| | | | | | Reported-by: Nicholas Joll <najoll@posteo.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* treewide: Add SPDX license identifier for missed filesThomas Gleixner2019-05-211-0/+1
| | | | | | | | | | | | | | | | | Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macroUladzislau Rezki (Sony)2019-05-191-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | This macro adds some debug code to check that vmap allocations are happened in ascending order. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macroUladzislau Rezki (Sony)2019-05-191-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | This macro adds some debug code to check that the augment tree is maintained correctly, meaning that every node contains valid subtree_max_size value. By default this option is set to 0 and not active. It requires recompilation of the kernel to activate it. Set to 1, compile the kernel. [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: keep track of free blocks for vmap allocationUladzislau Rezki (Sony)2019-05-191-246/+758
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "improve vmap allocation", v3. Objective --------- Please have a look for the description at: https://lkml.org/lkml/2018/10/19/786 but let me also summarize it a bit here as well. The current implementation has O(N) complexity. Requests with different permissive parameters can lead to long allocation time. When i say "long" i mean milliseconds. Description ----------- This approach organizes the KVA memory layout into free areas of the 1-ULONG_MAX range, i.e. an allocation is done over free areas lookups, instead of finding a hole between two busy blocks. It allows to have lower number of objects which represent the free space, therefore to have less fragmented memory allocator. Because free blocks are always as large as possible. It uses the augment tree where all free areas are sorted in ascending order of va->va_start address in pair with linked list that provides O(1) access to prev/next elements. Since the tree is augment, we also maintain the "subtree_max_size" of VA that reflects a maximum available free block in its left or right sub-tree. Knowing that, we can easily traversal toward the lowest (left most path) free area. Allocation: ~O(log(N)) complexity. It is sequential allocation method therefore tends to maximize locality. The search is done until a first suitable block is large enough to encompass the requested parameters. Bigger areas are split. I copy paste here the description of how the area is split, since i described it in https://lkml.org/lkml/2018/10/19/786 <snip> A free block can be split by three different ways. Their names are FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested size and alignment fit to a free block. FL_FIT_TYPE - in this case a free block is just removed from the free list/tree because it fully fits. Comparing with current design there is an extra work with rb-tree updating. LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is just cutting a free block. It is as fast as a current design. Most of the vmalloc allocations just end up with this case, because the edge is always aligned to 1. NE_FIT_TYPE - Is much less common case. Basically it happens when requested size and alignment does not fit left nor right edges, i.e. it is between them. In this case during splitting we have to build a remaining left free area and place it back to the free list/tree. Comparing with current design there are two extra steps. First one is we have to allocate a new vmap_area structure. Second one we have to insert that remaining free block to the address sorted list/tree. In order to optimize a first case there is a cache with free_vmap objects. Instead of allocating from slab we just take an object from the cache and reuse it. Second one is pretty optimized. Since we know a start point in the tree we do not do a search from the top. Instead a traversal begins from a rb-tree node we split. <snip> De-allocation. ~O(log(N)) complexity. An area is not inserted straight away to the tree/list, instead we identify the spot first, checking if it can be merged around neighbors. The list provides O(1) access to prev/next, so it is pretty fast to check it. Summarizing. If merged then large coalesced areas are created, if not the area is just linked making more fragments. There is one more thing that i should mention here. After modification of VA node, its subtree_max_size is updated if it was/is the biggest area in its left or right sub-tree. Apart of that it can also be populated back to upper levels to fix the tree. For more details please have a look at the __augment_tree_propagate_from() function and the description. Tests and stressing ------------------- I use the "test_vmalloc.sh" test driver available under "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo ./test_vmalloc.sh" to find out how to deal with it. Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA. Regarding last one, i do not have any physical access to NUMA system, therefore i emulated it. The time of stressing is days. If you run the test driver in "stress mode", you also need the patch that is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it: http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c After massive testing, i have not identified any problems like memory leaks, crashes or kernel panics. I find it stable, but more testing would be good. Performance analysis -------------------- I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1 as well, but the results have not been ready by time i an writing this. Currently it consist of 8 tests. There are three of them which correspond to different types of splitting(to compare with default). We have 3 ones(see above). Another 5 do allocations in different conditions. a) sudo ./test_vmalloc.sh performance When the test driver is run in "performance" mode, it runs all available tests pinned to first online CPU with sequential execution test order. We do it in order to get stable and repeatable results. Take a look at time difference in "long_busy_list_alloc_test". It is not surprising because the worst case is O(N). # i5-3320M How many cycles all tests took: CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt # Hikey960 8x CPUs How many cycles all tests took: CPU0=3478683207 cycles vs CPU0=463767978 cycles # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt b) time sudo ./test_vmalloc.sh test_repeat_count=1 With this configuration, all tests are run on all available online CPUs. Before running each CPU shuffles its tests execution order. It gives random allocation behaviour. So it is rough comparison, but it puts in the picture for sure. # i5-3320M <default> vs <patched> real 101m22.813s real 0m56.805s user 0m0.011s user 0m0.015s sys 0m5.076s sys 0m0.023s # See detailed table with results here: ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt # Hikey960 8x CPUs <default> vs <patched> real unknown real 4m25.214s user unknown user 0m0.011s sys unknown sys 0m0.670s I did not manage to complete this test on "default Hikey960" kernel version. After 24 hours it was still running, therefore i had to cancel it. That is why real/user/sys are "unknown". This patch (of 3): Currently an allocation of the new vmap area is done over busy list iteration(complexity O(n)) until a suitable hole is found between two busy areas. Therefore each new allocation causes the list being grown. Due to over fragmented list and different permissive parameters an allocation can take a long time. For example on embedded devices it is milliseconds. This patch organizes the KVA memory layout into free areas of the 1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks sorted by their offsets in pair with linked list keeping the free space in order of increasing addresses. Nodes are augmented with the size of the maximum available free block in its left or right sub-tree. Thus, that allows to take a decision and traversal toward the block that will fit and will have the lowest start address, i.e. it is sequential allocation. Allocation: to allocate a new block a search is done over the tree until a suitable lowest(left most) block is large enough to encompass: the requested size, alignment and vstart point. If the block is bigger than requested size - it is split. De-allocation: when a busy vmap area is freed it can either be merged or inserted to the tree. Red-black tree allows efficiently find a spot whereas a linked list provides a constant-time access to previous and next blocks to check if merging can be done. In case of merging of de-allocated memory chunk a large coalesced area is created. Complexity: ~O(log(N)) [urezki@gmail.com: v3] Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com [urezki@gmail.com: v4] Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_tUladzislau Rezki (Sony)2019-05-151-10/+10
| | | | | | | | | | | | | | | | | | | vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long" that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit as well. Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()Uladzislau Rezki (Sony)2019-05-151-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()") introduced some preempt points, one of those is making an allocation more prioritized over lazy free of vmap areas. Prioritizing an allocation over freeing does not work well all the time, i.e. it should be rather a compromise. 1) Number of lazy pages directly influences the busy list length thus on operations like: allocation, lookup, unmap, remove, etc. 2) Under heavy stress of vmalloc subsystem I run into a situation when memory usage gets increased hitting out_of_memory -> panic state due to completely blocking of logic that frees vmap areas in the __purge_vmap_area_lazy() function. Establish a threshold passing which the freeing is prioritized back over allocation creating a balance between each other. Using vmalloc test driver in "stress mode", i.e. When all available test cases are run simultaneously on all online CPUs applying a pressure on the vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact that __purge_vmap_area_lazy() logic simply is not able to free pages in time. How I run it: 1) You should build your kernel with CONFIG_TEST_VMALLOC=m 2) ./tools/testing/selftests/vm/test_vmalloc.sh stress During this test "vmap_lazy_nr" pages will go far beyond acceptable lazy_max_pages() threshold, that will lead to enormous busy list size and other problems including allocation time and so on. Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: Add flag for freeing of special permsissionsRick Edgecombe2019-04-301-19/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to immediately clear executable TLB entries before freeing pages, and handle resetting permissions on the directmap. This flag is useful for any kind of memory with elevated permissions, or where there can be related permissions changes on the directmap. Today this is RO+X and RO memory. Although this enables directly vfreeing non-writeable memory now, non-writable memory cannot be freed in an interrupt because the allocation itself is used as a node on deferred free list. So when RO memory needs to be freed in an interrupt the code doing the vfree needs to have its own work queue, as was the case before the deferred vfree list was added to vmalloc. For architectures with set_direct_map_ implementations this whole operation can be done with one TLB flush when centralized like this. For others with directmap permissions, currently only arm64, a backup method using set_memory functions is used to reset the directmap. When arm64 adds set_direct_map_ functions, this backup can be removed. When the TLB is flushed to both remove TLB entries for the vmalloc range mapping and the direct map permissions, the lazy purge operation could be done to try to save a TLB flush later. However today vm_unmap_aliases could flush a TLB range that does not include the directmap. So a helper is added with extra parameters that can allow both the vmalloc address and the direct mapping to be flushed during this operation. The behavior of the normal vm_unmap_aliases function is unchanged. Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Suggested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <akpm@linux-foundation.org> Cc: <ard.biesheuvel@linaro.org> Cc: <deneen.t.dock@intel.com> Cc: <kernel-hardening@lists.openwall.com> Cc: <kristen@linux.intel.com> Cc: <linux_dti@icloud.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190426001143.4983-17-namit@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* docs/core-api/mm: fix return value descriptions in mm/Mike Rapoport2019-03-061-10/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | Many kernel-doc comments in mm/ have the return value descriptions either misformatted or omitted at all which makes kernel-doc script unhappy: $ make V=1 htmldocs ... ./mm/util.c:36: info: Scanning doc for kstrdup ./mm/util.c:41: warning: No description found for return value of 'kstrdup' ./mm/util.c:57: info: Scanning doc for kstrdup_const ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const' ./mm/util.c:75: info: Scanning doc for kstrndup ./mm/util.c:83: warning: No description found for return value of 'kstrndup' ... Fixing the formatting and adding the missing return value descriptions eliminates ~100 such warnings. Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* docs/mm: vmalloc: re-indent kernel-doc comemntsMike Rapoport2019-03-061-185/+182Star
| | | | | | | | | | | | | | | | | | | Some kernel-doc comments in mm/vmalloc.c have leading tab in indentation. This leads to excessive indentation in the generated HTML and to the inconsistency of its layout ([1] vs [2]). Besides, multi-line Note: sections are not handled properly with extra indentation. [1] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vm_map_ram [2] https://www.kernel.org/doc/html/v4.20/core-api/mm-api.html?#c.vfree Link: http://lkml.kernel.org/r/1549549644-4903-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!Uladzislau Rezki (Sony)2019-03-061-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the vmalloc stress test case triggers the kernel BUG(): <snip> [60.562151] ------------[ cut here ]------------ [60.562154] kernel BUG at mm/vmalloc.c:512! [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161 [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390 <snip> it can happen due to big align request resulting in overflowing of calculated address, i.e. it becomes 0 after ALIGN()'s fixup. Fix it by checking if calculated address is within vstart/vend range. Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULEUladzislau Rezki (Sony)2019-03-061-0/+9
| | | | | | | | | | | | | | | | | | | | | | Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is enabled. Some test cases in vmalloc test suite module require and make use of that function. Please note, that it is not supposed to be used for other purposes. We need it only for performance analysis, stressing and stability check of vmalloc allocator. Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range()Roman Penyaev2019-03-061-23/+8Star
| | | | | | | | | | | | | | | | | vmalloc_user*() calls differ from normal vmalloc() only in that they set VM_USERMAP flags for the area. During the whole history of vmalloc.c changes now it is possible simply to pass VM_USERMAP flags directly to __vmalloc_node_range() call instead of finding the area (which obviously takes time) after the allocation. Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: do not call kmemleak_free() on not yet accounted memoryRoman Penyaev2019-03-061-5/+11
| | | | | | | | | | | | | | | | __vmalloc_area_node() calls vfree() on error path, which in turn calls kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc(). Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: fix size check for remap_vmalloc_range_partial()Roman Penyaev2019-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When VM_NO_GUARD is not set area->size includes adjacent guard page, thus for correct size checking get_vm_area_size() should be used, but not area->size. This fixes possible kernel oops when userspace tries to mmap an area on 1 page bigger than was allocated by vmalloc_user() call: the size check inside remap_vmalloc_range_partial() accounts non-existing guard page also, so check successfully passes but vmalloc_to_page() returns NULL (guard page does not physically exist). The following code pattern example should trigger an oops: static int oops_mmap(struct file *file, struct vm_area_struct *vma) { void *mem; mem = vmalloc_user(4096); BUG_ON(!mem); /* Do not care about mem leak */ return remap_vmalloc_range(vma, mem, 0); } And userspace simply mmaps size + PAGE_SIZE: mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0); Possible candidates for oops which do not have any explicit size checks: *** drivers/media/usb/stkwebcam/stk-webcam.c: v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0); Or the following one: *** drivers/video/fbdev/core/fbmem.c static int fb_mmap(struct file *file, struct vm_area_struct * vma) ... res = fb->fb_mmap(info, vma); Where fb_mmap callback calls remap_vmalloc_range() directly without any explicit checks: *** drivers/video/fbdev/vfb.c static int vfb_mmap(struct fb_info *info, struct vm_area_struct *vma) { return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff); } Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBARoman Penyaev2019-03-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch repeats the original one from David S Miller: 2dca6999eed5 ("mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA") but for missed vmalloc_32_user() case, which also requires correct alignment of virtual address on kernel side to avoid D-caches aliases. A bit of copy-paste from original patch to recover in memory of what is all about: When a vmalloc'd area is mmap'd into userspace, some kind of co-ordination is necessary for this to work on platforms with cpu D-caches which can have aliases. Otherwise kernel side writes won't be seen properly in userspace and vice versa. If the kernel side mapping and the user side one have the same alignment, modulo SHMLBA, this can work as long as VM_SHARED is shared of VMA and for all current users this is true. VM_SHARED will force SHMLBA alignment of the user side mmap on platforms with D-cache aliasing matters. David S. Miller > What are the user-visible runtime effects of this change? In simple words: proper alignment avoids possible difference in data, seen by different virtual mapings: userspace and kernel in our case. I.e. userspace reads cache line A, kernel writes to cache line B. Both cache lines correspond to the same physical memory (thus aliases). So this should fix data corruption for archs with vivt and vipt caches, e.g. armv6. Personally I've never worked with this archs, I just spotted the strange difference in code: for one case we do alignment, for another - not. I have a strong feeling that David simply missed vmalloc_32_user() case. > > Is a -stable backport needed? No, I do not think so. The only one user of vmalloc_32_user() is virtual frame buffer device drivers/video/fbdev/vfb.c, which has in the description "The main use of this frame buffer device is testing and debugging the frame buffer subsystem. Do NOT enable it for normal systems!". And it seems to me that this vfb.c does not need 32bit addressable pages (vmalloc_32_user() case), because it is virtual device and should not care about things like dma32 zones, etc. Probably is better to clean the code and switch vfb.c from vmalloc_32_user() to vmalloc_user() case and wipe out vmalloc_32_user() from vmalloc.c completely. But I'm not very much sure that this is worth to do, that's so minor, so we can leave it as is. Link: http://lkml.kernel.org/r/20190108110944.23591-1-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Michal Hocko <mhocko@suse.com> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()Liviu Dudau2019-03-061-1/+1
| | | | | | | | | | | | | | | | find_vmap_area() can return a NULL pointer and we're going to dereference it without checking it first. Use the existing find_vm_area() function which does exactly what we want and checks for the NULL pointer. Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk Fixes: f3c01d2f3ade ("mm: vmalloc: avoid racy handling of debugobjects in vunmap") Signed-off-by: Liviu Dudau <liviu@dudau.co.uk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS2018-12-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfree: add debug might_sleep()Andrey Ryabinin2018-10-271-0/+2
| | | | | | | | | | | | | Add might_sleep() call to vfree() to catch potential sleep-in-atomic bugs earlier. [aryabinin@virtuozzo.com: drop might_sleep_if() from kvfree()] Link: http://lkml.kernel.org/r/7e19e4df-b1a6-29bd-9ae7-0266d50bef1d@virtuozzo.com Link: http://lkml.kernel.org/r/20180914130512.10394-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: improve vfree() kerneldocAndrey Ryabinin2018-10-271-0/+2
| | | | | | | | | | | vfree() might sleep if called not in interrupt context. Explain that in the comment. Link: http://lkml.kernel.org/r/20180914130512.10394-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: provide a fallback for PAGE_KERNEL_EXEC for architecturesLuis R. Rodriguez2018-08-181-4/+0Star
| | | | | | | | | | | | | | | | Some architectures just don't have PAGE_KERNEL_EXEC. The mm/nommu.c and mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years. Move this fallback to asm-generic. Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: use octal not symbolic permissionsJoe Perches2018-06-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm/*.c files use symbolic and octal styles for permissions. Using octal and not symbolic permissions is preferred by many as more readable. https://lkml.org/lkml/2016/8/2/1945 Prefer the direct use of octal for permissions. Done using $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c and some typing. Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 44 After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 86 Miscellanea: o Whitespace neatening around these conversions. Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmalloc: pass proper vm_start into debugobjectsChintan Pandya2018-06-081-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | Client can call vunmap with some intermediate 'addr' which may not be the start of the VM area. Entire unmap code works with vm->vm_start which is proper but debug object API is called with 'addr'. This could be a problem within debug objects. Pass proper start address into debug object API. [akpm@linux-foundation.org: fix warning] Link: http://lkml.kernel.org/r/1523961828-9485-3-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmalloc: avoid racy handling of debugobjects in vunmapChintan Pandya2018-06-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, __vunmap flow is, 1) Release the VM area 2) Free the debug objects corresponding to that vm area. This leave some race window open. 1) Release the VM area 1.5) Some other client gets the same vm area 1.6) This client allocates new debug objects on the same vm area 2) Free the debug objects corresponding to this vm area. Here, we actually free 'other' client's debug objects. Fix this by freeing the debug objects first and then releasing the VM area. Link: http://lkml.kernel.org/r/1523961828-9485-2-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmalloc: clean up vunmap to avoid pgtable ops twiceChintan Pandya2018-06-081-22/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vunmap does page table clear operations twice in the case when DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled. So, clean up the code as that is unintended. As a perf gain, we save few us. Below ftrace data was obtained while doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch applied. After this patch, we can save ~3 us (on 1 extra vunmap_page_range). CPU DURATION FUNCTION CALLS | | | | | | | 6) | __vunmap() { 6) | vmap_debug_free_range() { 6) 3.281 us | vunmap_page_range(); 6) + 45.468 us | } 6) 2.760 us | vunmap_page_range(); 6) ! 505.105 us | } [cpandya@codeaurora.org: v3] Link: http://lkml.kernel.org/r/1525176960-18408-1-git-send-email-cpandya@codeaurora.org Link: http://lkml.kernel.org/r/1523876342-10545-1-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Laura Abbott <labbott@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: introduce proc_create_seq_privateChristoph Hellwig2018-05-161-15/+3Star
| | | | | | | | | | Variant of proc_create_data that directly take a struct seq_operations argument + a private state size and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
* proc: introduce proc_create_seq{,_data}Christoph Hellwig2018-05-161-5/+6
| | | | | | | | | Variants of proc_create{,_data} that directly take a struct seq_operations argument and drastically reduces the boilerplate code in the callers. All trivial callers converted over. Signed-off-by: Christoph Hellwig <hch@lst.de>
* vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systemsMichal Hocko2018-02-221-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in drivers/media/common/saa7146/saa7146_core.c since 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly"). saa7146_vmalloc_build_pgtable uses vmalloc_32 and it is reasonable to expect that the resulting page is not in highmem. The above commit aimed to add __GFP_HIGHMEM only for those requests which do not specify any zone modifier gfp flag. vmalloc_32 relies on GFP_VMALLOC32 which should do the right thing. Except it has been missed that GFP_VMALLOC32 is an alias for GFP_KERNEL on 32b architectures. Thanks to Matthew to notice this. Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32 for !64b arches (as a bailout). This should do the right thing and use ZONE_NORMAL which should be always below 4G on 32b systems. Debugged by Matthew Wilcox. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180212095019.GX21609@dhcp22.suse.cz Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Kai Heng Feng <kai.heng.feng@canonical.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Laura Abbott <labbott@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "vmalloc: back off when the current task is killed"Johannes Weiner2017-10-141-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commits 5d17a73a2ebe ("vmalloc: back off when the current task is killed") and 171012f56127 ("mm: don't warn when vmalloc() fails due to a fatal signal"). Commit 5d17a73a2ebe ("vmalloc: back off when the current task is killed") made all vmalloc allocations from a signal-killed task fail. We have seen crashes in the tty driver from this, where a killed task exiting tries to switch back to N_TTY, fails n_tty_open because of the vmalloc failing, and later crashes when dereferencing tty->disc_data. Arguably, relying on a vmalloc() call to succeed in order to properly exit a task is not the most robust way of doing things. There will be a follow-up patch to the tty code to fall back to the N_NULL ldisc. But the justification to make that vmalloc() call fail like this isn't convincing, either. The patch mentions an OOM victim exhausting the memory reserves and thus deadlocking the machine. But the OOM killer is only one, improbable source of fatal signals. It doesn't make sense to fail allocations preemptively with plenty of memory in most cases. The patch doesn't mention real-life instances where vmalloc sites would exhaust memory, which makes it sound more like a theoretical issue to begin with. But just in case, the OOM access to memory reserves has been restricted on the allocator side in cd04ae1e2dc8 ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access"), which should take care of any theoretical concerns on that front. Revert this patch, and the follow-up that suppresses the allocation warnings when we fail the allocations due to a signal. Link: http://lkml.kernel.org/r/20171004185906.GB2136@cmpxchg.org Fixes: 171012f56127 ("mm: don't warn when vmalloc() fails due to a fatal signal") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alan Cox <alan@llwyncelyn.cymru> Cc: Christoph Hellwig <hch@lst.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: don't reinvent the wheel but use existing llist APIByungchul Park2017-09-071-6/+4Star
| | | | | | | | | | | | | | Although llist provides proper APIs, they are not used. Make them used. Link: http://lkml.kernel.org/r/1502095374-16112-1-git-send-email-byungchul.park@lge.com Signed-off-by: Byungchul Park <byungchul.park@lge.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joel Fernandes <joelaf@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()Wei Yang2017-09-071-7/+3Star
| | | | | | | | | | | | | | | | | | | | In pcpu_get_vm_areas(), it checks each range is not overlapped. To make sure it is, only (N^2)/2 comparison is necessary, while current code does N^2 times. By starting from the next range, it achieves the goal and the continue could be removed. Also, - the overlap check of two ranges could be done with one clause - one typo in comment is fixed. Link: http://lkml.kernel.org/r/20170803063822.48702-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEMLaura Abbott2017-08-191-5/+8
| | | | | | | | | | | | | | | | | | | | Commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") added use of __GFP_HIGHMEM for allocations. vmalloc_32 may use GFP_DMA/GFP_DMA32 which does not play nice with __GFP_HIGHMEM and will trigger a BUG in gfp_zone. Only add __GFP_HIGHMEM if we aren't using GFP_DMA/GFP_DMA32. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1482249 Link: http://lkml.kernel.org/r/20170816220705.31374-1-labbott@redhat.com Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly") Signed-off-by: Laura Abbott <labbott@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmalloc: show lazy-purged vma info in vmallocinfoYisheng Xie2017-07-111-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ioremap a 67112960 bytes vm_area with the vmallocinfo: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap we get the result: 0xf1000000-0xf5001000 67112960 devm_ioremap+0x38/0x7c phys=40000000 ioremap For the align for ioremap must be less than '1 << IOREMAP_MAX_ORDER': if (flags & VM_IOREMAP) align = 1ul << clamp_t(int, get_count_order_long(size), PAGE_SHIFT, IOREMAP_MAX_ORDER); So it makes idiot like me a litte puzzled why this was a jump the vm_area from 0xec800000-0xecbe1000 to 0xf1000000-0xf5001000, and leaving 0xed000000-0xf1000000 as a big hole. This patch is to show all of vm_area, including vmas which are freeing but still in the vmap_area_list, to make it more clear about why we will get 0xf1000000-0xf5001000 in the above case. And we will get a vmallocinfo like: [..] 0xec79b000-0xec7fa000 389120 ftl_add_mtd+0x4d0/0x754 pages=94 vmalloc 0xec800000-0xecbe1000 4067328 kbox_proc_mem_write+0x104/0x1c4 phys=8b520000 ioremap [..] 0xece7c000-0xece7e000 8192 unpurged vm_area 0xece7e000-0xece83000 20480 vm_map_ram 0xf0099000-0xf00aa000 69632 vm_map_ram after this patch. Link: http://lkml.kernel.org/r/1496649682-20710-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zijun_hu <zijun_hu@htc.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objectsCatalin Marinas2017-07-071-6/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | Kmemleak requires that vmalloc'ed objects have a minimum reference count of 2: one in the corresponding vm_struct object and the other owned by the vmalloc() caller. There are cases, however, where the original vmalloc() returned pointer is lost and, instead, a pointer to vm_struct is stored (see free_thread_stack()). Kmemleak currently reports such objects as leaks. This patch adds support for treating any surplus references to an object as additional references to a specified object. It introduces the kmemleak_vmalloc() API function which takes a vm_struct pointer and sets its surplus reference passing to the actual vmalloc() returned pointer. The __vmalloc_node_range() calling site has been modified accordingly. Link: http://lkml.kernel.org/r/1495726937-23557-4-git-send-email-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappingsArd Biesheuvel2017-06-241-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Existing code that uses vmalloc_to_page() may assume that any address for which is_vmalloc_addr() returns true may be passed into vmalloc_to_page() to retrieve the associated struct page. This is not un unreasonable assumption to make, but on architectures that have CONFIG_HAVE_ARCH_HUGE_VMAP=y, it no longer holds, and we need to ensure that vmalloc_to_page() does not go off into the weeds trying to dereference huge PUDs or PMDs as table entries. Given that vmalloc() and vmap() themselves never create huge mappings or deal with compound pages at all, there is no correct answer in this case, so return NULL instead, and issue a warning. When reading /proc/kcore on arm64, you will hit an oops as soon as you hit the huge mappings used for the various segments that make up the mapping of vmlinux. With this patch applied, you will no longer hit the oops, but the kcore contents willl be incorrect (these regions will be zeroed out) We are fixing this for kcore specifically, so it avoids vread() for those regions. At least one other problematic user exists, i.e., /dev/kmem, but that is currently broken on arm64 for other reasons. Link: http://lkml.kernel.org/r/20170609082226.26152-1-ard.biesheuvel@linaro.org Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>