summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* memcg: fix/change behavior of shared anon at moving taskKAMEZAWA Hiroyuki2012-05-304-53/+18Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes memcg's behavior at task_move(). At task_move(), the kernel scans a task's page table and move the changes for mapped pages from source cgroup to target cgroup. There has been a bug at handling shared anonymous pages for a long time. Before patch: - The spec says 'shared anonymous pages are not moved.' - The implementation was 'shared anonymoys pages may be moved'. If page_mapcount <=2, shared anonymous pages's charge were moved. After patch: - The spec says 'all anonymous pages are moved'. - The implementation is 'all anonymous pages are moved'. Considering usage of memcg, this will not affect user's experience. 'shared anonymous' pages only exists between a tree of processes which don't do exec(). Moving one of process without exec() seems not sane. For example, libcgroup will not be affected by this change. (Anyway, no one noticed the implementation for a long time...) Below is a discussion log: - current spec/implementation are complex - Now, shared file caches are moved - It adds unclear check as page_mapcount(). To do correct check, we should check swap users, etc. - No one notice this implementation behavior. So, no one get benefit from the design. - In general, once task is moved to a cgroup for running, it will not be moved.... - Finally, we have control knob as memory.move_charge_at_immigrate. Here is a patch to allow moving shared pages, completely. This makes memcg simpler and fix current broken code. Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: fix memory leak on extending regionsGavin Shan2012-05-301-13/+24
| | | | | | | | | | | | | | | | | | | | | The overall memblock has been organized into the memory regions and reserved regions. Initially, the memory regions and reserved regions are stored in the predetermined arrays of "struct memblock _region". It's possible for the arrays to be enlarged when we have newly added regions, but no free space left there. The policy here is to create double-sized array either by slab allocator or memblock allocator. Unfortunately, we didn't free the old array, which might be allocated through slab allocator before. That would cause memory leak. The patch introduces 2 variables to trace where (slab or memblock) the memory and reserved regions come from. The memory for the memory or reserved regions will be deallocated by kfree() if that was allocated by slab allocator. Thus to fix the memory leak issue. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: cleanup on duplicate VA/PA conversionGavin Shan2012-05-301-2/+3
| | | | | | | | | | | | | | | | | | | The overall memblock has been organized into the memory regions and reserved regions. Initially, the memory regions and reserved regions are stored in the predetermined arrays of "struct memblock _region". It's possible for the arrays to be enlarged when we have newly added regions for them, but no enough space there. Under the situation, We will created double-sized array to meet the requirement. However, the original implementation converted the VA (Virtual Address) of the newly allocated array of regions to PA (Physical Address), then translate back when we allocates the new array from slab. That's actually unnecessary. The patch removes the duplicate VA/PA conversion. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix slab->page flags corruptionPravin B Shelar2012-05-302-2/+37
| | | | | | | | | | | | | | | | | | | Transparent huge pages can change page->flags (PG_compound_lock) without taking Slab lock. Since THP can not break slab pages we can safely access compound page without taking compound lock. Specifically this patch fixes a race between compound_unlock() and slab functions which perform page-flags updates. This can occur when get_page()/put_page() is called on a page from slab. [akpm@linux-foundation.org: tweak comment text, fix comment layout, fix label indenting] Reported-by: Amey Bhide <abhide@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix faulty initialization in vmalloc_init()KyongHo2012-05-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | The transfer of ->flags causes some of the static mapping virtual addresses to be prematurely freed (before the mapping is removed) because VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might cause subsequent vmalloc/ioremap calls to fail because it might allocate one of the freed virtual address ranges that aren't unmapped. va->flags has different types of flags from tmp->flags. If a region with VM_IOREMAP set is registered with vm_area_add_early(), it will be removed by __purge_vmap_area_lazy(). Fix vmalloc_init() to correctly initialize vmap_area for the given vm_struct. Also initialise va->vm. If it is not set, find_vm_area() for the early vm regions will always fail. Signed-off-by: KyongHo Cho <pullip.cho@samsung.com> Cc: "Olav Haugan" <ohaugan@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race conditionAndrea Arcangeli2012-05-302-2/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) 497 { 498 /* depend on compiler for an atomic pmd read */ 499 pmd_t pmdval = *pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, oom: normalize oom scores to oom_score_adj scale only for userspaceDavid Rientjes2012-05-303-32/+22Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The oom_score_adj scale ranges from -1000 to 1000 and represents the proportion of memory available to the process at allocation time. This means an oom_score_adj value of 300, for example, will bias a process as though it was using an extra 30.0% of available memory and a value of -350 will discount 35.0% of available memory from its usage. The oom killer badness heuristic also uses this scale to report the oom score for each eligible process in determining the "best" process to kill. Thus, it can only differentiate each process's memory usage by 0.1% of system RAM. On large systems, this can end up being a large amount of memory: 256MB on 256GB systems, for example. This can be fixed by having the badness heuristic to use the actual memory usage in scoring threads and then normalizing it to the oom_score_adj scale for userspace. This results in better comparison between eligible threads for kill and no change from the userspace perspective. Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: avoid swapping out with swappiness==0Satoru Moriya2012-05-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes we'd like to avoid swapping out anonymous memory. In particular, avoid swapping out pages of important process or process groups while there is a reasonable amount of pagecache on RAM so that we can satisfy our customers' requirements. OTOH, we can control how aggressive the kernel will swap memory pages with /proc/sys/vm/swappiness for global and /sys/fs/cgroup/memory/memory.swappiness for each memcg. But with current reclaim implementation, the kernel may swap out even if we set swappiness=0 and there is pagecache in RAM. This patch changes the behavior with swappiness==0. If we set swappiness==0, the kernel does not swap out completely (for global reclaim until the amount of free pages and filebacked pages in a zone has been reduced to something very very small (nr_free + nr_filebacked < high watermark)). Signed-off-by: Satoru Moriya <satoru.moriya@hds.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: fix resv_map leak in error pathDave Hansen2012-05-301-6/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When called for anonymous (non-shared) mappings, hugetlb_reserve_pages() does a resv_map_alloc(). It depends on code in hugetlbfs's vm_ops->close() to release that allocation. However, in the mmap() failure path, we do a plain unmap_region() without the remove_vma() which actually calls vm_ops->close(). This is a decent fix. This leak could get reintroduced if new code (say, after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return an error. But, I think it would have to unroll the reservation anyway. Christoph's test case: http://marc.info/?l=linux-mm&m=133728900729735 This patch applies to 3.4 and later. A version for earlier kernels is at https://lkml.org/lkml/2012/5/22/418. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Christoph Lameter <cl@linux.com> Tested-by: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/bootmem.c: cleanup on addition to bootmem data listGavin Shan2012-05-301-8/+8
| | | | | | | | | | | | | | | | | | | | The objects of "struct bootmem_data_t" are linked together to form double-linked list sequentially based on its minimal page frame number. The current implementation implicitly supports the following cases, which means the inserting point for current bootmem data depends on how "list_for_each" works. That makes the code a little hard to read. Besides, "list_for_each" and "list_entry" can be replaced with "list_for_each_entry". - The linked list is empty. - There has no entry in the linked list, whose minimal page frame number is bigger than current one. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: consider all swapped back pages in used-once logicMichal Hocko2012-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 645747462435 ("vmscan: detect mapped file pages used only once") made mapped pages have another round in inactive list because they might be just short lived and so we could consider them again next time. This heuristic helps to reduce pressure on the active list with a streaming IO worklods. This patch fixes a regression introduced by this commit for heavy shmem based workloads because unlike Anon pages, which are excluded from this heuristic because they are usually long lived, shmem pages are handled as a regular page cache. This doesn't work quite well, unfortunately, if the workload is mostly backed by shmem (in memory database sitting on 80% of memory) with a streaming IO in the background (backup - up to 20% of memory). Anon inactive list is full of (dirty) shmem pages when watermarks are hit. Shmem pages are kept in the inactive list (they are referenced) in the first round and it is hard to reclaim anything else so we reach lower scanning priorities very quickly which leads to an excessive swap out. Let's fix this by excluding all swap backed pages (they tend to be long lived wrt. the regular page cache anyway) from used-once heuristic and rather activate them if they are referenced. The customer's workload is shmem backed database (80% of RAM) and they are measuring transactions/s with an IO in the background (20%). Transactions touch more or less random rows in the table. The transaction rate fell by a factor of 3 (in the worst case) because of commit 64574746. This patch restores the previous numbers. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: document the meminfo and vmstat fields of relevance to transparent hugepagesMel Gorman2012-05-302-0/+64
| | | | | | | | | | | Update Documentation/vm/transhuge.txt and Documentation/filesystems/proc.txt with some information on monitoring transparent huge page usage and the associated overhead. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: cleanupsAndrew Morton2012-05-301-4/+3Star
| | | | | | | | | | | - make pageflag_names[] const - remove null termination of pageflag_names[] Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_alloc: catch out-of-date list of page flag namesJohannes Weiner2012-05-301-0/+2
| | | | | | | | | | | String tables with names of enum items are always prone to go out of sync with the enums themselves. Ensure during compile time that the name table of page flags has the same size as the page flags enum. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/buddy: dump PG_compound_lock page flagGavin Shan2012-05-301-0/+3
| | | | | | | | | | | | | | | | | The array pageflag_names[] does conversion from page flags into their corresponding names so that a meaningful representation of the corresponding page flag can be printed. This mechanism is used while dumping page frames. However, the array missed PG_compound_lock. So the PG_compound_lock page flag would be printed as a digital number instead of a meaningful string. The patch fixes that and prints "compound_lock" for the PG_compound_lock page flag. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move readahead syscall to mm/readahead.cCong Wang2012-05-302-39/+40
| | | | | | | | | | | It is better to define readahead(2) in mm/readahead.c than in mm/filemap.c. Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: support SEEK_DATA and SEEK_HOLEHugh Dickins2012-05-301-1/+93
| | | | | | | | | | | | | | | | | | | | | | | | | It's quite easy for tmpfs to scan the radix_tree to support llseek's new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still on my mind (in particular, the !PageUptodate-ness of pages fallocated but still unwritten). But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile? [akpm@linux-foundation.org: fix warning with CONFIG_TMPFS=n] Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Josef Bacik <josef@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Dave Chinner <david@fromorbit.com> Cc: Marco Stornelli <marco.stornelli@gmail.com> Cc: Jeff liu <jeff.liu@oracle.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: quit when fallocate fills memoryHugh Dickins2012-05-301-2/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As it stands, a large fallocate() on tmpfs is liable to fill memory with pages, freed on failure except when they run into swap, at which point they become fixed into the file despite the failure. That feels quite wrong, to be consuming resources precisely when they're in short supply. Go the other way instead: shmem_fallocate() indicate the range it has fallocated to shmem_writepage(), keeping count of pages it's allocating; shmem_writepage() reactivate instead of swapping out pages fallocated by this syscall (but happily swap out those from earlier occasions), keeping count; shmem_fallocate() compare counts and give up once the reactivated pages have started to coming back to writepage (approximately: some zones would in fact recycle faster than others). This is a little unusual, but works well: although we could consider the failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling added in swapfile.c and memcontrol.c, I doubt that we shall ever want to. (If there's no swap, an over-large fallocate() on tmpfs is limited in the same way as writing: stopped by rlimit, or by tmpfs mount size if that was set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or OVERCOMMIT_NEVER. If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill others as writing would, but stops and frees if interrupted.) Now that everything is freed on failure, we can then skip updating ctime. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: undo fallocation on failureHugh Dickins2012-05-301-33/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | In the previous episode, we left the already-fallocated pages attached to the file when shmem_fallocate() fails part way through. Now try to do better, by extending the earlier optimization of !Uptodate pages (then always under page lock) to !Uptodate pages (outside of page lock), representing fallocated pages. And don't waste time clearing them at the time of fallocate(), leave that until later if necessary. Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing fallocate can recognize and remove precisely those !Uptodate allocations which it added (and were not independently allocated by racing tasks). But unless we start playing with swapfile.c and memcontrol.c too, once one of our fallocated pages reaches shmem_writepage(), we do then have to instantiate it as an ordinarily allocated page, before swapping out. This is unsatisfactory, but improved in the next episode. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: support fallocate preallocationHugh Dickins2012-05-301-1/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The systemd plumbers expressed a wish that tmpfs support preallocation. Cong Wang wrote a patch, but several kernel guys expressed scepticism: https://lkml.org/lkml/2011/11/18/137 Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly. Hugh Dickins: If tmpfs is going to support fallocate(FALLOC_FL_PUNCH_HOLE), it would seem perverse to permit the deallocation but fail the allocation. Christoph Hellwig: Agreed. Now that we do have shmem_fallocate() for hole-punching, plumb in basic support for preallocation mode too. It's fairly straightforward (though quite a few details needed attention), except for when it fails part way through. What a pity that fallocate(2) was not specified to return the length allocated, permitting short fallocations! As it is, when it fails part way through, we ought to free what has just been allocated by this system call; but must be very sure not to free any allocated earlier, or any allocated by racing accesses (not all excluded by i_mutex). But we cannot distinguish them: so in this patch simply leak allocations on partial failure (they will be freed later if the file is removed). An attractive alternative approach would have been for fallocate() not to allocate pages at all, but note reservations by entries in the radix-tree. But that would give less assurance, and, critically, would be hard to fit with mem cgroups (who owns the reservations?): allocating pages lets fallocate() behave in just the same way as write(). Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/fs: remove truncate_rangeHugh Dickins2012-05-307-39/+8Star
| | | | | | | | | | | | | | | | | | Remove vmtruncate_range(), and remove the truncate_range method from struct inode_operations: only tmpfs ever supported it, and tmpfs has now converted over to using the fallocate method of file_operations. Update Documentation accordingly, adding (setlease and) fallocate lines. And while we're in mm.h, remove duplicate declarations of shmem_lock() and shmem_file_setup(): everyone is now using the ones in shmem_fs.h. Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLEHugh Dickins2012-05-302-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now tmpfs supports hole-punching via fallocate(), switch madvise_remove() to use do_fallocate() instead of vmtruncate_range(): which extends madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs. There is one more user of vmtruncate_range() in our tree, staging/android's ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned areas are already unmapped - I don't know - then it would do better to use shmem_truncate_range() directly). Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Ben Myers <bpm@sgi.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: support fallocate FALLOC_FL_PUNCH_HOLEHugh Dickins2012-05-301-11/+57
| | | | | | | | | | | | | | | | | | | tmpfs has supported hole-punching since 2.6.16, via madvise(,,MADV_REMOVE). But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is the agreed way to punch holes. So add shmem_fallocate() to support that, and tweak shmem_truncate_range() to support partial pages at both the beginning and end of range (never needed for madvise, which demands rounded addr and rounds up length). Based-on-patch-by: Cong Wang <amwang@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cong Wang <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: optimize clearing when writingHugh Dickins2012-05-301-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nick proposed years ago that tmpfs should avoid clearing its pages where write will overwrite them with new data, as ramfs has long done. But I messed it up and just got bad data. Tried again recently, it works fine. Here's time output for writing 4GiB 16 times on this Core i5 laptop: before: real 0m21.169s user 0m0.028s sys 0m21.057s real 0m21.382s user 0m0.016s sys 0m21.289s real 0m21.311s user 0m0.020s sys 0m21.217s after: real 0m18.273s user 0m0.032s sys 0m18.165s real 0m18.354s user 0m0.020s sys 0m18.265s real 0m18.440s user 0m0.032s sys 0m18.337s ramfs: real 0m16.860s user 0m0.028s sys 0m16.765s real 0m17.382s user 0m0.040s sys 0m17.273s real 0m17.133s user 0m0.044s sys 0m17.021s Yes, I have done perf reports, but they need more explanation than they deserve: in summary, clear_page vanishes, its cache loading shifts into copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly mark_page_accessed goes way up - I think because they are respectively where the cache gets to be reloaded after being purged by clear or copy. Suggested-by: Nick Piggin <npiggin@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs: enable NOSEC optimizationHugh Dickins2012-05-301-0/+1
| | | | | | | | | | | | | Let tmpfs into the NOSEC optimization (avoiding file_remove_suid() overhead on most common writes): set MS_NOSEC on its superblocks. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* shmem: replace page if mapping excludes its zoneHugh Dickins2012-05-304-24/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GMA500 GPU driver uses GEM shmem objects, but with a new twist: the backing RAM has to be below 4GB. Not a problem while the boards supported only 4GB: but now Intel's D2700MUD boards support 8GB, and their GMA3600 is managed by the GMA500 driver. shmem/tmpfs has never pretended to support hardware restrictions on the backing memory, but it might have appeared to do so before v3.1, and even now it works fine until a page is swapped out then back in. When read_cache_page_gfp() supplied a freshly allocated page for copy, that compensated for whatever choice might have been made by earlier swapin readahead; but swapoff was likely to destroy the illusion. We'd like to continue to support GMA500, so now add a new shmem_should_replace_page() check on the zone when about to move a page from swapcache to filecache (in swapin and swapoff cases), with shmem_replace_page() to allocate and substitute a suitable page (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). This does involve a minor extension to mem_cgroup_replace_page_cache() (the page may or may not have already been charged); and I've removed a comment and call to mem_cgroup_uncharge_cache_page(), which in fact is always a no-op while PageSwapCache. Also removed optimization of an unlikely path in shmem_getpage_gfp(), now that we need to check PageSwapCache more carefully (a racing caller might already have made the copy). And at one point shmem_unuse_inode() needs to use the hitherto private page_swapcount(), to guard against racing with inode eviction. It would make sense to extend shmem_should_replace_page(), to cover cpuset and NUMA mempolicy restrictions too, but set that aside for now: needs a cleanup of shmem mempolicy handling, and more testing, and ought to handle swap faults in do_swap_page() as well as shmem. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Marchesin <marcheu@chromium.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Rob Clark <rob.clark@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocksBartlomiej Zolnierkiewicz2012-05-304-28/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an allocation takes ownership of the block may take too long. The type of the pageblock remains unchanged so the pageblock cannot be used as a migration target during compaction. Fix it by: * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and COMPACT_SYNC) and then converting sync field in struct compact_control to use it. * Adding nr_pageblocks_skipped field to struct compact_control and tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type. If COMPACT_ASYNC_MOVABLE mode compaction ran fully in try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a suitable page for allocation. In this case then check how if there were enough MIGRATE_UNMOVABLE pageblocks to try a second pass in COMPACT_ASYNC_UNMOVABLE mode. * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all pages within the MIGRATE_UNMOVABLE pageblock are in one of those three sets change the whole pageblock type to MIGRATE_MOVABLE. My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means 131072 standard 4KiB pages in 'Normal' zone) is to: - allocate 120000 pages for kernel's usage - free every second page (60000 pages) of memory just allocated - allocate and use 60000 pages from user space - free remaining 60000 pages of kernel memory (now we have fragmented memory occupied mostly by user space pages) - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage The results: - with compaction disabled I get 11 successful allocations - with compaction enabled - 14 successful allocations - with this patch I'm able to get all 100 successful allocations NOTE: If we can make kswapd aware of order-0 request during compaction, we can enhance kswapd with changing mode to COMPACT_ASYNC_FULL (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the following thread: http://marc.info/?l=linux-mm&m=133552069417068&w=2 [minchan@kernel.org: minor cleanups] Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: remove sparsemem allocation details from the bootmem allocatorJohannes Weiner2012-05-304-60/+12Star
| | | | | | | | | | | | | | | | | | | alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. As __alloc_bootmem_node_nopanic() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. [akpm@linux-foundation.org: fix build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: pass pgdat instead of pgdat->bdata down the stackJohannes Weiner2012-05-301-10/+10
| | | | | | | | | | | | | | Pass down the node descriptor instead of the more specific bootmem node descriptor down the call stack, like nobootmem does, when there is no good reason for the two to be different. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: unify allocation policy of (non-)panicking node allocationsJohannes Weiner2012-05-301-52/+54
| | | | | | | | | | | | | | | | | | While the panicking node-specific allocation function tries to satisfy node+goal, goal, node, anywhere, the non-panicking function still does node+goal, goal, anywhere. Make it simpler: define the panicking version in terms of the non-panicking one, like the node-agnostic interface, so they always behave the same way apart from how to deal with allocation failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: nobootmem: panic on node-specific allocation failureJohannes Weiner2012-05-301-4/+16
| | | | | | | | | | | | | __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims the functions panic on allocation failure. Do it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: unify allocation policy of (non-)panicking node allocationsJohannes Weiner2012-05-301-20/+24
| | | | | | | | | | | | | | | | | | While the panicking node-specific allocation function tries to satisfy node+goal, goal, node, anywhere, the non-panicking function still does node+goal, goal, anywhere. Make it simpler: define the panicking version in terms of the non-panicking one, like the node-agnostic interface, so they always behave the same way apart from how to deal with allocation failure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: allocate in order node+goal, goal, node, anywhereJohannes Weiner2012-05-301-1/+13
| | | | | | | | | | | | | | Match the nobootmem version of __alloc_bootmem_node. Try to satisfy both the node and the goal, then just the goal, then just the node, then allocate anywhere before panicking. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: split out goal-to-node mapping from goal droppingJohannes Weiner2012-05-301-2/+15
| | | | | | | | | | | | | | | Matching the desired goal to the right node is one thing, dropping the goal when it can not be satisfied is another. Split this into separate functions so that subsequent patches can use the node-finding but drop and handle the goal fallback on their own terms. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: rename alloc_bootmem_core to alloc_bootmem_bdataJohannes Weiner2012-05-301-7/+7
| | | | | | | | | | | | | Callsites need to provide a bootmem_data_t *, make the naming more descriptive. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: remove redundant offset check when finally freeing bootmemJohannes Weiner2012-05-301-1/+1
| | | | | | | | | | | | | | | | | | When bootmem releases an unaligned BITS_PER_LONG pages chunk of memory to the page allocator, it checks the bitmap if there are still unreserved pages in the chunk (set bits), but also if the offset in the chunk indicates BITS_PER_LONG loop iterations already. But since the consulted bitmap is only a one-word-excerpt of the full per-node bitmap, there can not be more than BITS_PER_LONG bits set in it. The additional offset check is unnecessary. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: bootmem: fix checking the bitmap when finally freeing bootmemGavin Shan2012-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | When bootmem releases an unaligned chunk of memory at the beginning of a node to the page allocator, it iterates from that unaligned PFN but checks an aligned word of the page bitmap. The checked bits do not correspond to the PFNs and, as a result, reserved pages can be freed. Properly shift the bitmap word so that the lowest bit corresponds to the starting PFN before entering the freeing loop. This bug has been around since commit 41546c17418f ("bootmem: clean up free_all_bootmem_core") (2.6.27) without known reports. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: remove pageblock_default_order()Andrew Morton2012-05-301-18/+15Star
| | | | | | | | | | | | | | | | | This has always been broken: one version takes an unsigned int and the other version takes no arguments. This bug was hidden because one version of set_pageblock_order() was a macro which doesn't evaluate its argument. Simplify it all and remove pageblock_default_order() altogether. Reported-by: rajman mekaco <rajman.mekaco@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: move is_vma_temporary_stack() declaration to huge_mm.hAlex Shi2012-05-302-2/+2
| | | | | | | | | | | | | | | | | | | | | | | When transparent_hugepage_enabled() is used outside mm/, such as in arch/x86/xx/tlb.c: + if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB + || transparent_hugepage_enabled(vma)) { + flush_tlb_mm(vma->vm_mm); is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile errors: arch/x86/mm/tlb.c: In function `flush_tlb_range': arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration] Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it is better to move it to huge_mm.h from rmap.h to avoid such errors. Signed-off-by: Alex Shi <alex.shi@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tools/vm/page-types.c: cleanupsUlrich Drepper2012-05-301-11/+11
| | | | | | | | | | | | | | | | Compiling page-type.c with a recent compiler produces many warnings, mostly related to signed/unsigned comparisons. This patch cleans up most of them. One remaining warning is about an unused parameter. The <compiler.h> file doesn't define a __unused macro (or the like) yet. This can be addressed later. Signed-off-by: Ulrich Drepper <drepper@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kbuild: install kernel-page-flags.hUlrich Drepper2012-05-303-27/+6Star
| | | | | | | | | | | | | | | | Programs using /proc/kpageflags need to know about the various flags. The <linux/kernel-page-flags.h> provides them and the comments in the file indicate that it is supposed to be used by user-level code. But the file is not installed. Install the headers and mark the unstable flags as out-of-bounds. The page-type tool is also adjusted to not duplicate the definitions Signed-off-by: Ulrich Drepper <drepper@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: print physical addresses consistently with other parts of kernelBjorn Helgaas2012-05-302-14/+19
| | | | | | | | | | | | | | | | | | | | | Print physical address info in a style consistent with the %pR style used elsewhere in the kernel. For example: -Zone PFN ranges: +Zone ranges: - DMA32 0x00000010 -> 0x00100000 + DMA32 [mem 0x00010000-0xffffffff] - Normal 0x00100000 -> 0x01080000 + Normal [mem 0x100000000-0x107fffffff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* swiotlb: print physical addresses consistently with other parts of kernelBjorn Helgaas2012-05-301-5/+3Star
| | | | | | | | | | | | | | | | | | Print swiotlb info in a style consistent with the %pR style used elsewhere in the kernel. For example: -Placing 64MB software IO TLB between ffff88007a662000 - ffff88007e662000 -software IO TLB at phys 0x7a662000 - 0x7e662000 +software IO TLB [mem 0x7a662000-0x7e661fff] (64MB) mapped at [ffff88007a662000-ffff88007e661fff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: print physical addresses consistently with other parts of kernelBjorn Helgaas2012-05-307-62/+63
| | | | | | | | | | | | | | | | | | | | | | | Print physical address info in a style consistent with the %pR style used elsewhere in the kernel. For example: -found SMP MP-table at [ffff8800000fce90] fce90 +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90] -initial memory mapped : 0 - 20000000 +initial memory mapped: [mem 0x00000000-0x1fffffff] -Base memory trampoline at [ffff88000009c000] 9c000 size 8192 +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000] -SRAT: Node 0 PXM 0 0-80000000 +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86: print e820 physical addresses consistently with other parts of kernelBjorn Helgaas2012-05-301-26/+27
| | | | | | | | | | | | | | | | | | | | | | | Print physical address info in a style consistent with the %pR style used elsewhere in the kernel. For example: -BIOS-provided physical RAM map: +e820: BIOS-provided physical RAM map: - BIOS-e820: 0000000000000100 - 000000000009e000 (usable) +BIOS-e820: [mem 0x0000000000000100-0x000000000009dfff] usable -Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000) +e820: [mem 0x90000000-0xfed1bfff] available for PCI devices -reserve RAM buffer: 000000000009e000 - 000000000009ffff +e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: completely remove code generated by disabled VM_BUG_ON()Konstantin Khlebnikov2012-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON() for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in do_huge_pmd_wp_page() generates 114 bytes of code. But they mostly disappears when I split this VM_BUG_ON into two: -VM_BUG_ON(!PageCompound(page) || !PageHead(page)); +VM_BUG_ON(!PageCompound(page)); +VM_BUG_ON(!PageHead(page)); weird... but anyway after this patch code disappears completely. add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649) Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bug: introduce BUILD_BUG_ON_INVALID() macroKonstantin Khlebnikov2012-05-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | Sometimes we want to check some expressions correctness at compile time. "(void)(e);" or "if (e);" can be dangerous if the expression has side-effects, and gcc sometimes generates a lot of code, even if the expression has no effect. This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it forces a compilation error if expression is invalid without any extra code. [Cast to "long" required because sizeof does not work for bit-fields.] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Cross Memory Attach: make it KconfigurableChristopher Yeoh2012-05-302-2/+15
| | | | | | | | | Add a Kconfig option to allow people who don't want cross memory attach to not have it included in their build. Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Documentation: memcg: future proof hierarchical statistics documentationJohannes Weiner2012-05-301-11/+4Star
| | | | | | | | | | | | | | | | | The hierarchical versions of per-memcg counters in memory.stat are all calculated the same way and are all named total_<counter>. Documenting the pattern is easier for maintenance than listing each counter twice. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Ying Han <yinghan@google.com> Randy Dunlap <rdunlap@xenotime.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, thp: drop page_table_lock to uncharge memcg pagesDavid Rientjes2012-05-301-0/+2
| | | | | | | | | | | | | mm->page_table_lock is hotly contested for page fault tests and isn't necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page(). Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>