summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()Fengguang Wu2012-12-191-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neil found that if too_many_isolated() returns true while performing direct reclaim we can end up waiting for other threads to complete their direct reclaim. If those threads are allowed to enter the FS or IO to free memory, but this thread is not, then it is possible that those threads will be waiting on this thread and so we get a circular deadlock. some task enters direct reclaim with GFP_KERNEL => too_many_isolated() false => vmscan and run into dirty pages => pageout() => take some FS lock => fs/block code does GFP_NOIO allocation => enter direct reclaim again => too_many_isolated() true => waiting for others to progress, however the other tasks may be circular waiting for the FS lock.. The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher priority than normal ones, by lowering the throttle threshold for the latter. Allowing ~1/8 isolated pages in normal is large enough. For example, for a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16 logical CPUs per NUMA node). So it's not likely some CPU goes idle waiting (when it could make progress) because of this limit: there are much more sleeping reclaim tasks than the number of CPU, so the task may well be blocked by some low level queue/lock anyway. Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress. They will be blocked only when there are too many concurrent !GFP_IOFS reclaims, however that's very unlikely because the IO-less direct reclaims is able to progress much more faster, and they won't deadlock each other. The threshold is raised high enough for them, so that there can be sufficient parallel progress of !GFP_IOFS reclaims. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: NeilBrown <neilb@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vmscan: comment too_many_isolated()Fengguang Wu2012-12-191-1/+5
| | | | | | | | | | | | | Comment "Why it's doing so" rather than "What it does" as proposed by Andrew Morton. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/kmemleak.c: remove obsolete simple_strtoulAbhijit Pawar2012-12-191-1/+2
| | | | | | | | | Replace the obsolete simple_strtoul() with kstrtoul(). Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory_hotplug.c: improve commentsTang Chen2012-12-191-6/+12
| | | | | | | | | | Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hugetlb: create hugetlb cgroup file in hugetlb_initJianguo Wu2012-12-192-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Build kernel with CONFIG_HUGETLBFS=y,CONFIG_HUGETLB_PAGE=y and CONFIG_CGROUP_HUGETLB=y, then specify hugepagesz=xx boot option, system will fail to boot. This failure is caused by following code path: setup_hugepagesz hugetlb_add_hstate hugetlb_cgroup_file_init cgroup_add_cftypes kzalloc <--slab is *not available* yet For this path, slab is not available yet, so memory allocated will be failed, and cause WARN_ON() in hugetlb_cgroup_file_init(). So I move hugetlb_cgroup_file_init() into hugetlb_init(). [akpm@linux-foundation.org: tweak coding-style, remove pointless __init on inlined function] [akpm@linux-foundation.org: fix warning] Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mprotect.c: coding-style cleanupsAndrew Morton2012-12-191-14/+16
| | | | | | | A few gremlins have recently crept in. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: drop mutex before deleting sysfs entryGlauber Costa2012-12-191-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sasha Levin recently reported a lockdep problem resulting from the new attribute propagation introduced by kmemcg series. In short, slab_mutex will be called from within the sysfs attribute store function. This will create a dependency, that will later be held backwards when a cache is destroyed - since destruction occurs with the slab_mutex held, and then calls in to the sysfs directory removal function. In this patch, I propose to adopt a strategy close to what __kmem_cache_create does before calling sysfs_slab_add, and release the lock before the call to sysfs_slab_remove. This is pretty much the last operation in the kmem_cache_shutdown() path, so we could do better by splitting this and moving this call alone to later on. This will fit nicely when sysfs handling is consistent between all caches, but will look weird now. Lockdep info: ====================================================== [ INFO: possible circular locking dependency detected ] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W ------------------------------------------------------- trinity-child13/6961 is trying to acquire lock: (s_active#43){++++.+}, at: sysfs_addrm_finish+0x31/0x60 but task is already holding lock: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (slab_mutex){+.+.+.}: lock_acquire+0x1aa/0x240 __mutex_lock_common+0x59/0x5a0 mutex_lock_nested+0x3f/0x50 slab_attr_store+0xde/0x110 sysfs_write_file+0xfa/0x150 vfs_write+0xb0/0x180 sys_pwrite64+0x60/0xb0 tracesys+0xe1/0xe6 -> #0 (s_active#43){++++.+}: __lock_acquire+0x14df/0x1ca0 lock_acquire+0x1aa/0x240 sysfs_deactivate+0x122/0x1a0 sysfs_addrm_finish+0x31/0x60 sysfs_remove_dir+0x89/0xd0 kobject_del+0x16/0x40 __kmem_cache_shutdown+0x40/0x60 kmem_cache_destroy+0x40/0xe0 mon_text_release+0x78/0xe0 __fput+0x122/0x2d0 ____fput+0x9/0x10 task_work_run+0xbe/0x100 do_exit+0x432/0xbd0 do_group_exit+0x84/0xd0 get_signal_to_deliver+0x81d/0x930 do_signal+0x3a/0x950 do_notify_resume+0x3e/0x90 int_signal+0x12/0x17 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(slab_mutex); lock(s_active#43); lock(slab_mutex); lock(s_active#43); *** DEADLOCK *** 2 locks held by trinity-child13/6961: #0: (mon_lock){+.+.+.}, at: mon_text_release+0x25/0xe0 #1: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x22/0xe0 stack backtrace: Pid: 6961, comm: trinity-child13 Tainted: G W 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Call Trace: print_circular_bug+0x1fb/0x20c __lock_acquire+0x14df/0x1ca0 lock_acquire+0x1aa/0x240 sysfs_deactivate+0x122/0x1a0 sysfs_addrm_finish+0x31/0x60 sysfs_remove_dir+0x89/0xd0 kobject_del+0x16/0x40 __kmem_cache_shutdown+0x40/0x60 kmem_cache_destroy+0x40/0xe0 mon_text_release+0x78/0xe0 __fput+0x122/0x2d0 ____fput+0x9/0x10 task_work_run+0xbe/0x100 do_exit+0x432/0xbd0 do_group_exit+0x84/0xd0 get_signal_to_deliver+0x81d/0x930 do_signal+0x3a/0x950 do_notify_resume+0x3e/0x90 int_signal+0x12/0x17 Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: add comments clarifying aspects of cache attribute propagationGlauber Costa2012-12-192-4/+18
| | | | | | | | | | | | | | | | | | | | | | This patch clarifies two aspects of cache attribute propagation. First, the expected context for the for_each_memcg_cache macro in memcontrol.h. The usages already in the codebase are safe. In mm/slub.c, it is trivially safe because the lock is acquired right before the loop. In mm/slab.c, it is less so: the lock is acquired by an outer function a few steps back in the stack, so a VM_BUG_ON() is added to make sure it is indeed safe. A comment is also added to detail why we are returning the value of the parent cache and ignoring the children's when we propagate the attributes. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: slub-specific propagation changesGlauber Costa2012-12-191-1/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SLUB allows us to tune a particular cache behavior with sysfs-based tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This can be done by tapping into the store attribute function provided by the allocator. We of course don't need to mess with read-only fields. Since the attributes can have multiple types and are stored internally by sysfs, the best strategy is to issue a ->show() in the root cache, and then ->store() in the memcg cache. The drawback of that, is that sysfs can allocate up to a page in buffering for show(), that we are likely not to need, but also can't guarantee. To avoid always allocating a page for that, we can update the caches at store time with the maximum attribute size ever stored to the root cache. We will then get a buffer big enough to hold it. The corolary to this, is that if no stores happened, nothing will be propagated. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. [akpm@linux-foundation.org: tweak code to avoid __maybe_unused] Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab: propagate tunable valuesGlauber Costa2012-12-194-10/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: aggregate memcg cache values in slabinfoGlauber Costa2012-12-193-5/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | When we create caches in memcgs, we need to display their usage information somewhere. We'll adopt a scheme similar to /proc/meminfo, with aggregate totals shown in the global file, and per-group information stored in the group itself. For the time being, only reads are allowed in the per-group cache. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg/sl[au]b: shrink dead cachesGlauber Costa2012-12-191-3/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This means that when we destroy a memcg cache that happened to be empty, those caches may take a lot of time to go away: removing the memcg reference won't destroy them - because there are pending references, and the empty pages will stay there, until a shrinker is called upon for any reason. In this patch, we will call kmem_cache_shrink() for all dead caches that cannot be destroyed because of remaining pages. After shrinking, it is possible that it could be freed. If this is not the case, we'll schedule a lazy worker to keep trying. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg/sl[au]b: track all the memcg children of a kmem_cacheGlauber Costa2012-12-192-2/+50
| | | | | | | | | | | | | | | | | | | | | | | | This enables us to remove all the children of a kmem_cache being destroyed, if for example the kernel module it's being used in gets unloaded. Otherwise, the children will still point to the destroyed parent. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: destroy memcg cachesGlauber Costa2012-12-194-1/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement destruction of memcg caches. Right now, only caches where our reference counter is the last remaining are deleted. If there are any other reference counters around, we just leave the caches lying around until they go away. When that happens, a destruction function is called from the cache code. Caches are only destroyed in process context, so we queue them up for later processing in the general case. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sl[au]b: allocate objects from memcg cacheGlauber Costa2012-12-193-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sl[au]b: always get the cache from its page in kmem_cache_free()Glauber Costa2012-12-194-14/+48
| | | | | | | | | | | | | | | | | | | | | | | struct page already has this information. If we start chaining caches, this information will always be more trustworthy than whatever is passed into the function. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: skip memcg kmem allocations in specified code regionsGlauber Costa2012-12-191-3/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | Create a mechanism that skip memcg allocations during certain pieces of our core code. It basically works in the same way as preempt_disable()/preempt_enable(): By marking a region under which all allocations will be accounted to the root memcg. We need this to prevent races in early cache creation, when we allocate data using caches that are not necessarily created already. Signed-off-by: Glauber Costa <glommer@parallels.com> yCc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: infrastructure to match an allocation to the right cacheGlauber Costa2012-12-191-0/+217
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page allocator is able to bind a page to a memcg when it is allocated. But for the caches, we'd like to have as many objects as possible in a page belonging to the same cache. This is done in this patch by calling memcg_kmem_get_cache in the beginning of every allocation function. This function is patched out by static branches when kernel memory controller is not being used. It assumes that the task allocating, which determines the memcg in the page allocator, belongs to the same cgroup throughout the whole process. Misaccounting can happen if the task calls memcg_kmem_get_cache() while belonging to a cgroup, and later on changes. This is considered acceptable, and should only happen upon task migration. Before the cache is created by the memcg core, there is also a possible imbalance: the task belongs to a memcg, but the cache being allocated from is the global cache, since the child cache is not yet guaranteed to be ready. This case is also fine, since in this case the GFP_KMEMCG will not be passed and the page allocator will not attempt any cgroup accounting. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: allocate memory for memcg caches whenever a new memcg appearsGlauber Costa2012-12-192-16/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every cache that is considered a root cache (basically the "original" caches, tied to the root memcg/no-memcg) will have an array that should be large enough to store a cache pointer per each memcg in the system. Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the 64k pointers range. Most of the time, we won't be using that much. What goes in this patch, is a simple scheme to dynamically allocate such an array, in order to minimize memory usage for memcg caches. Because we would also like to avoid allocations all the time, at least for now, the array will only grow. It will tend to be big enough to hold the maximum number of kmem-limited memcgs ever achieved. We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more than that, we'll start doubling the size of this array every time the limit is reached. Because we are only considering kmem limited memcgs, a natural point for this to happen is when we write to the limit. At that point, we already have set_limit_mutex held, so that will become our natural synchronization mechanism. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab/slub: consider a memcg parameter in kmem_create_cacheGlauber Costa2012-12-194-17/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow a memcg parameter to be passed during cache creation. When the slub allocator is being used, it will only merge caches that belong to the same memcg. We'll do this by scanning the global list, and then translating the cache to a memcg-specific cache Default function is created as a wrapper, passing NULL to the memcg version. We only merge caches that belong to the same memcg. A helper is provided, memcg_css_id: because slub needs a unique cache name for sysfs. Since this is visible, but not the canonical location for slab data, the cache name is not used, the css_id should suffice. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab: annotate on-slab caches nodelist locksGlauber Costa2012-12-191-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently provide lockdep annotation for kmalloc caches, and also caches that have SLAB_DEBUG_OBJECTS enabled. The reason for this is that we can quite frequently nest in the l3->list_lock lock, which is not something trivial to avoid. My proposal with this patch, is to extend this to caches whose slab management object lives within the slab as well ("on_slab"). The need for this arose in the context of testing kmemcg-slab patches. With such patchset, we can have per-memcg kmalloc caches. So the same path that led to nesting between kmalloc caches will could then lead to in-memcg nesting. Because they are not annotated, lockdep will trigger. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slab/slub: struct memcg_paramsGlauber Costa2012-12-191-0/+13
| | | | | | | | | | | | | | | | | | | | | | For the kmem slab controller, we need to record some extra information in the kmem_cache structure. Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Suleiman Souhlal <suleiman@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: execute the whole memcg freeing in free_worker()Glauber Costa2012-12-191-32/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lot of the initialization we do in mem_cgroup_create() is done with softirqs enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This means that the freeing of memcg structure must happen in a compatible context, otherwise we'll get a deadlock, like the one below, caught by lockdep: free_accounted_pages+0x47/0x4c free_task+0x31/0x5c __put_task_struct+0xc2/0xdb put_task_struct+0x1e/0x22 delayed_put_task_struct+0x7a/0x98 __rcu_process_callbacks+0x269/0x3df rcu_process_callbacks+0x31/0x5b __do_softirq+0x122/0x277 This usage pattern could not be triggered before kmem came into play. With the introduction of kmem stack handling, it is possible that we call the last mem_cgroup_put() from the task destructor, which is run in an rcu callback. Such callbacks are run with softirqs disabled, leading to the offensive usage pattern. In general, we have little, if any, means to guarantee in which context the last memcg_put will happen. The best we can do is test it and try to make sure no invalid context releases are happening. But as we add more code to memcg, the possible interactions grow in number and expose more ways to get context conflicts. One thing to keep in mind, is that part of the freeing process is already deferred to a worker, such as vfree(), that can only be called from process context. For the moment, the only two functions we really need moved away are: * free_css_id(), and * mem_cgroup_remove_from_trees(). But because the later accesses per-zone info, free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we are left with the per_cpu stats only. Better move it all. Signed-off-by: Glauber Costa <glommer@parallels.com> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: allow a memcg with kmem charges to be destructedGlauber Costa2012-12-191-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, we can't guarantee that we'll always be able to point a page to a particular process, and migrate the charges along with it - since in the common case, a page will contain data belonging to multiple processes. Because of that, when we destroy a memcg, we only make sure the destruction will succeed by discounting the kmem charges from the user charges when we try to empty the cgroup. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: use static branches when code not in useGlauber Costa2012-12-191-4/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use static branches to patch the code in or out when not used. Because the _ACTIVE bit on kmem_accounted is only set after the increment is done, we guarantee that the root memcg will always be selected for kmem charges until all call sites are patched (see memcg_kmem_enabled). This guarantees that no mischarges are applied. Static branch decrement happens when the last reference count from the kmem accounting in memcg dies. This will only happen when the charges drop down to 0. When that happens, we need to disable the static branch only on those memcgs that enabled it. To achieve this, we would be forced to complicate the code by keeping track of which memcgs were the ones that actually enabled limits, and which ones got it from its parents. It is a lot simpler just to do static_key_slow_inc() on every child that is accounted. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: kmem accounting lifecycle managementGlauber Costa2012-12-191-7/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because kmem charges can outlive the cgroup, we need to make sure that we won't free the memcg structure while charges are still in flight. For reviewing simplicity, the charge functions will issue mem_cgroup_get() at every charge, and mem_cgroup_put() at every uncharge. This can get expensive, however, and we can do better. mem_cgroup_get() only really needs to be issued once: when the first limit is set. In the same spirit, we only need to issue mem_cgroup_put() when the last charge is gone. We'll need an extra bit in kmem_account_flags for that: KMEM_ACCOUNTED_DEAD. it will be set when the cgroup dies, if there are charges in the group. If there aren't, we can proceed right away. Our uncharge function will have to test that bit every time the charges drop to 0. Because that is not the likely output of res_counter_uncharge, this should not impose a big hit on us: it is certainly much better than a reference count decrease at every operation. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: allocate kernel pages to the right memcgGlauber Costa2012-12-191-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a process tries to allocate a page with the __GFP_KMEMCG flag, the page allocator will call the corresponding memcg functions to validate the allocation. Tasks in the root memcg can always proceed. To avoid adding markers to the page - and a kmem flag that would necessarily follow, as much as doing page_cgroup lookups for no reason, whoever is marking its allocations with __GFP_KMEMCG flag is responsible for telling the page allocator that this is such an allocation at free_pages() time. This is done by the invocation of __free_accounted_pages() and free_accounted_pages(). Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: kmem controller infrastructureGlauber Costa2012-12-191-0/+170
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with static branches, so we don't incur any overhead when no mem cgroups with limited kmem are being used. Users of this functionality shall interact with the memcg core code through the following functions: memcg_kmem_newpage_charge: will return true if the group can handle the allocation. At this point, struct page is not yet allocated. memcg_kmem_commit_charge: will either revert the charge, if struct page allocation failed, or embed memcg information into page_cgroup. memcg_kmem_uncharge_page: called at free time, will revert the charge. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: kmem accounting basic infrastructureGlauber Costa2012-12-191-3/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the basic infrastructure for the accounting of kernel memory. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the "kmem" res_counter. Per cgroup kmem memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled for that group. This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage, until memory.kmem.limit_in_bytes is set for the first time. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub] Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: change defines to an enumGlauber Costa2012-12-191-10/+16
| | | | | | | | | | | | | | | | | | | | | | This is just a cleanup patch for clarity of expression. In earlier submissions, people asked it to be in a separate patch, so here it is. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: reclaim when more than one page neededSuleiman Souhlal2012-12-191-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mem_cgroup_do_charge() was written before kmem accounting, and expects three cases: being called for 1 page, being called for a stock of 32 pages, or being called for a hugepage. If we call for 2 or 3 pages (and both the stack and several slabs used in process creation are such, at least with the debug options I had), it assumed it's being called for stock and just retried without reclaiming. Fix that by passing down a minsize argument in addition to the csize. And what to do about that (csize == PAGE_SIZE && ret) retry? If it's needed at all (and presumably is since it's there, perhaps to handle races), then it should be extended to more than PAGE_SIZE, yet how far? And should there be a retry count limit, of what? For now retry up to COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if __GFP_NORETRY. v4: fixed nr pages calculation pointed out by Christoph Lameter. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: make it possible to use the stock for more than one pageSuleiman Souhlal2012-12-191-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have a percpu stock cache scheme that charges one page at a time from memcg->res, the user counter. When the kernel memory controller comes into play, we'll need to charge more than that. This is because kernel memory allocations will also draw from the user counter, and can be bigger than a single page, as it is the case with the stack (usually 2 pages) or some higher order slabs. [glommer@parallels.com: added a changelog ] Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memory-hotplug: document and enable CONFIG_MOVABLE_NODETang Chen2012-12-191-1/+12
| | | | | | | | | | | | | | | | | | Add help info for CONFIG_MOVABLE_NODE and permit its selection. This option allows the user to online all memory of a node as movable memory. So that the whole node can be hotplugged. Users who don't use the hotplug feature are also fine with this option on since they won't online memory as movable. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> [akpm@linux-foundation.org: tweak help text] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: remove duplicate checkGavin Shan2012-12-191-2/+1Star
| | | | | | | | | | | | | | | | | While allocating pages using buddy allocator, the compound page is probably split up to free pages. Under these circumstances, the compound page should be destroyed by destroy_compound_page(). However, there is a duplicate check to judge if the page is compound. Remove the duplicate check since the compound_order() returns 0 when the page doesn't have PG_head set in destroy_compound_page(). That is to say, destroy_compound_page() needn't check PageHead(). Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'slab/for-linus' of ↵Linus Torvalds2012-12-185-484/+387Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "This contains preparational work from Christoph Lameter and Glauber Costa for SLAB memcg and cleanups and improvements from Ezequiel Garcia and Joonsoo Kim. Please note that the SLOB cleanup commit from Arnd Bergmann already appears in your tree but I had also merged it myself which is why it shows up in the shortlog." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm/sl[aou]b: Common alignment code slab: Use the new create_boot_cache function to simplify bootstrap slub: Use statically allocated kmem_cache boot structure for bootstrap mm, sl[au]b: create common functions for boot slab creation slab: Simplify bootstrap slub: Use correct cpu_slab on dead cpu mm: fix slab.c kernel-doc warnings mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN slab: Ignore internal flags in cache creation mm/slob: Use free_page instead of put_page for page-size kmalloc allocations mm/sl[aou]b: Move common kmem_cache_size() to slab.h mm/slob: Use object_size field in kmem_cache_size() mm/slob: Drop usage of page->private for storing page-sized allocations slub: Commonize slab_cache field in struct page sl[au]b: Process slabinfo_show in common code mm/sl[au]b: Move print_slabinfo_header to slab_common.c mm/sl[au]b: Move slabinfo processing to slab_common.c slub: remove one code path and reduce lock contention in __slab_free()
| * Merge branch 'slab/next' into slab/for-linusPekka Enberg2012-12-185-326/+221Star
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix up a trivial merge conflict with commit baaf1dd ("mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN") that did not go through the slab tree. Conflicts: mm/slob.c Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/sl[aou]b: Common alignment codeChristoph Lameter2012-12-115-69/+34Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract the code to do object alignment from the allocators. Do the alignment calculations in slab_common so that the __kmem_cache_create functions of the allocators do not have to deal with alignment. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slab: Use the new create_boot_cache function to simplify bootstrapChristoph Lameter2012-12-111-33/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify setup and reduce code in kmem_cache_init(). This allows us to get rid of initarray_cache as well as the manual setup code for the kmem_cache and kmem_cache_node arrays during bootstrap. We introduce a new bootstrap state "PARTIAL" for slab that signals the creation of a kmem_cache boot cache. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slub: Use statically allocated kmem_cache boot structure for bootstrapChristoph Lameter2012-12-111-47/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify bootstrap by statically allocated two kmem_cache structures. These are freed after bootup is complete. Allows us to no longer worry about calculations of sizes of kmem_cache structures during bootstrap. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm, sl[au]b: create common functions for boot slab creationChristoph Lameter2012-12-114-66/+60Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a special function to create kmalloc caches and use that function in SLAB and SLUB. Acked-by: Joonsoo Kim <js1304@gmail.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slab: Simplify bootstrapChristoph Lameter2012-12-111-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nodelists field in kmem_cache is pointing to the first unused object in the array field when bootstrap is complete. A problem with the current approach is that the statically sized kmem_cache structure use on boot can only contain NR_CPUS entries. If the number of nodes plus the number of cpus is greater then we would overwrite memory following the kmem_cache_boot definition. Increase the size of the array field to ensure that also the node pointers fit into the array field. Once we do that we no longer need the kmem_cache_nodelists array and we can then also use that structure elsewhere. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slub: Use correct cpu_slab on dead cpuChristoph Lameter2012-12-111-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Pass a kmem_cache_cpu pointer into unfreeze partials so that a different kmem_cache_cpu structure than the local one can be specified. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm: fix slab.c kernel-doc warningsRandy Dunlap2012-11-151-4/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix new kernel-doc warnings in mm/slab.c: Warning(mm/slab.c:2358): No description found for parameter 'cachep' Warning(mm/slab.c:2358): Excess function parameter 'name' description in '__kmem_cache_create' Warning(mm/slab.c:2358): Excess function parameter 'size' description in '__kmem_cache_create' Warning(mm/slab.c:2358): Excess function parameter 'align' description in '__kmem_cache_create' Warning(mm/slab.c:2358): Excess function parameter 'ctor' description in '__kmem_cache_create' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/slob: use min_t() to compare ARCH_SLAB_MINALIGNArnd Bergmann2012-10-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The definition of ARCH_SLAB_MINALIGN is architecture dependent and can be either of type size_t or int. Comparing that value with ARCH_KMALLOC_MINALIGN can cause harmless warnings on platforms where they are different. Since both are always small positive integer numbers, using the size_t type to compare them is safe and gets rid of the warning. Without this patch, building ARM collie_defconfig results in: mm/slob.c: In function '__kmalloc_node': mm/slob.c:431:152: warning: comparison of distinct pointer types lacks a cast [enabled by default] mm/slob.c: In function 'kfree': mm/slob.c:484:153: warning: comparison of distinct pointer types lacks a cast [enabled by default] mm/slob.c: In function 'ksize': mm/slob.c:503:153: warning: comparison of distinct pointer types lacks a cast [enabled by default] Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> [ penberg@kernel.org: updates for master ] Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slab: Ignore internal flags in cache creationGlauber Costa2012-10-314-25/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some flags are used internally by the allocators for management purposes. One example of that is the CFLGS_OFF_SLAB flag that slab uses to mark that the metadata for that cache is stored outside of the slab. No cache should ever pass those as a creation flags. We can just ignore this bit if it happens to be passed (such as when duplicating a cache in the kmem memcg patches). Because such flags can vary from allocator to allocator, we allow them to make their own decisions on that, defining SLAB_AVAILABLE_FLAGS with all flags that are valid at creation time. Allocators that doesn't have any specific flag requirement should define that to mean all flags. Common code will mask out all flags not belonging to that set. Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/slob: Use free_page instead of put_page for page-size kmalloc allocationsEzequiel Garcia2012-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing objects, the slob allocator currently free empty pages calling __free_pages(). However, page-size kmallocs are disposed using put_page() instead. It makes no sense to call put_page() for kernel pages that are provided by the object allocator, so we shouldn't be doing this ourselves. This is based on: commit d9b7f22623b5fa9cc189581dcdfb2ac605933bf4 Author: Glauber Costa <glommer@parallels.com> slub: use free_page instead of put_page for freeing kmalloc allocation Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/sl[aou]b: Move common kmem_cache_size() to slab.hEzequiel Garcia2012-10-313-21/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is identically defined in all three allocators and it's trivial to move it to slab.h Since now it's static, inline, header-defined function this patch also drops the EXPORT_SYMBOL tag. Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/slob: Use object_size field in kmem_cache_size()Ezequiel Garcia2012-10-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fields object_size and size are not the same: the latter might include slab metadata. Return object_size field in kmem_cache_size(). Also, improve trace accuracy by correctly tracing reported size. Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * mm/slob: Drop usage of page->private for storing page-sized allocationsEzequiel Garcia2012-10-311-14/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This field was being used to store size allocation so it could be retrieved by ksize(). However, it is a bad practice to not mark a page as a slab page and then use fields for special purposes. There is no need to store the allocated size and ksize() can simply return PAGE_SIZE << compound_order(page). Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * slub: Commonize slab_cache field in struct pageGlauber Costa2012-10-241-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, slab and slub have fields in struct page to derive which cache a page belongs to, but they do it slightly differently. slab uses a field called slab_cache, that lives in the third double word. slub, uses a field called "slab", living outside of the doublewords area. Ideally, we could use the same field for this. Since slub heavily makes use of the doubleword region, there isn't really much room to move slub's slab_cache field around. Since slab does not have such strict placement restrictions, we can move it outside the doubleword area. The naming used by slab, "slab_cache", is less confusing, and it is preferred over slub's generic "slab". Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>