summaryrefslogtreecommitdiffstats
path: root/fs/dcache.c
Commit message (Collapse)AuthorAgeFilesLines
...
* | vfs: don't copy things to user space holding the rcu readlockLinus Torvalds2013-09-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | Oops. That wasn't very smart. We don't actually need the RCU lock any more by the time we copy the cwd string to user space, but I had stupidly surrounded the whole thing with it. Introduced by commit 8b19e34188a3 ("vfs: make getcwd() get the root and pwd path under rcu") Is-a-big-hairy-idiot: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: make getcwd() get the root and pwd path under rcuLinus Torvalds2013-09-121-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | This allows us to skip all the crazy spinlocks and reference count updates, and instead use the fs sequence read-lock to get an atomic snapshot of the root and cwd information. We might want to make the rule that "prepend_path()" is always called with the RCU lock held, but the RCU lock nests fine and this is the minimal fix. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: move get_fs_root_and_pwd() to single callerLinus Torvalds2013-09-121-0/+11
| | | | | | | | | | | | | | | | Let's not pollute the include files with inline functions that are only used in a single place. Especially not if we decide we might want to change the semantics of said function to make it more efficient.. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | dcache: get/release read lock in read_seqbegin_or_lock() & friendWaiman Long2013-09-121-15/+16
|/ | | | | | | | | | | | | | | | | | | This patch modifies read_seqbegin_or_lock() and need_seqretry() to use newly introduced read_seqlock_excl() and read_sequnlock_excl() primitives so that they won't change the sequence number even if they fall back to take the lock. This is OK as no change to the protected data structure is being made. It will prevent one fallback to lock taking from cascading into a series of lock taking reducing performance because of the sequence number change. It will also allow other sequence readers to go forward while an exclusive reader lock is taken. This patch also updates some of the inaccurate comments in the code. Signed-off-by: Waiman Long <Waiman.Long@hp.com> To: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-101-76/+144
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks
| * split read_seqretry_or_unlock(), convert d_walk() to resulting primitivesAl Viro2013-09-091-33/+31Star
| | | | | | | | | | | | | | | | | | | | Separate "check if we need to retry" from "unlock if we are done and had seq_writelock"; that allows to use these guys in d_walk(), where we need to recheck every time we ascend back to parent, but do *not* want to unlock until the very end. Lift rcu_read_lock/rcu_read_unlock out into callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * dcache: Translating dentry into pathname without taking rename_lockWaiman Long2013-09-091-63/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running the AIM7's short workload, Linus' lockref patch eliminated most of the spinlock contention. However, there were still some left: 8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock |--42.21%-- d_path | proc_pid_readlink | SyS_readlinkat | SyS_readlink | system_call | __GI___readlink | |--40.97%-- sys_getcwd | system_call | __getcwd The big one here is the rename_lock (seqlock) contention in d_path() and the getcwd system call. This patch will eliminate the need to take the rename_lock while translating dentries into the full pathnames. The need to take the rename_lock is to make sure that no rename operation can be ongoing while the translation is in progress. However, only one thread can take the rename_lock thus blocking all the other threads that need it even though the translation process won't make any change to the dentries. This patch will replace the writer's write_seqlock/write_sequnlock sequence of the rename_lock of the callers of the prepend_path() and __dentry_path() functions with the reader's read_seqbegin/read_seqretry sequence within these 2 functions. As a result, the code will have to retry if one or more rename operations had been performed. In addition, RCU read lock will be taken during the translation process to make sure that no dentries will go away. To prevent live-lock from happening, the code will switch back to take the rename_lock if read_seqretry() fails for three times. To further reduce spinlock contention, this patch does not take the dentry's d_lock when copying the filename from the dentries. Instead, it treats the name pointer and length as unreliable and just copy the string byte-by-byte over until it hits a null byte or the end of string as specified by the length. This should avoid stepping into invalid memory address. The error cases are left to be handled by the sequence number check. The following code re-factoring are also made: 1. Move prepend('/') into prepend_name() to remove one conditional check. 2. Move the global root check in prepend_path() back to the top of the while loop. With this patch, the _raw_spin_lock will now account for only 1.2% of the total CPU cycles for the short workload. This patch also has the effect of reducing the effect of running perf on its profile since the perf command itself can be a heavy user of the d_path() function depending on the complexity of the workload. When taking the perf profile of the high-systime workload, the amount of spinlock contention contributed by running perf without this patch was about 16%. With this patch, the spinlock contention caused by the running of perf will go away and we will have a more accurate perf profile. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vfs: use lockred "dead" flag to mark unrecoverably dead dentriesLinus Torvalds2013-09-081-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This simplifies the RCU to refcounting code in particular. I was originally intending to leave this for later, but walking through all the dput() logic (see previous commit), I realized that the dput() "might_sleep()" check was misleadingly weak. And I removed it as misleading, both for performance profiling and for debugging. However, the might_sleep() debugging case is actually true: the final dput() can indeed sleep, if the inode of the dentry that you are releasing ends up sleeping at iput time (see dentry_iput()). So the problem with the might_sleep() in dput() wasn't that it wasn't true, it was that it wasn't actually testing and triggering on the interesting case. In particular, just about *any* dput() can indeed sleep, if you happen to race with another thread deleting the file in question, and you then lose the race to the be the last dput() for that file. But because it's a very rare race, the debugging code would never trigger it in practice. Why is this problematic? The new d_rcu_to_refcount() (see commit 15570086b590: "vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()") does a dput() for the failure case, and it does it under the RCU lock. So potentially sleeping really is a bug. But there's no way I'm going to fix this with the previous complicated "lockref_get_or_lock()" interface. And rather than revert to the old and crufty nested dentry locking code (which did get this right by delaying the reference count updates until they were verified to be safe), let's make forward progress. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: reorganize dput() memory accessesLinus Torvalds2013-09-081-10/+10
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is me being a bit OCD after all the dentry optimization work this merge window: profiles end up showing 'dput()' as a rather expensive operation, and there were two unrelated bad reasons for that. The first reason was reading d_lockref.count for debugging purposes, which touches the lockref cacheline (for reads) before really need to. More importantly, the debugging test in question is _wrong_, and has hidden bugs. It's true that we can only sleep when the count goes down to zero, but the test as-is hides the much more subtle bug that happens if we race with somebody else deleting the file. Anyway we _will_ touch that cacheline, but let's do it for a write and in the right routine (ie in "lockref_put_or_lock()") which annotates the costs better. So remove the misleading debug code. The other was an unnecessary access to the cacheline that contains the d_lru list, just to check whether we already were on the LRU list or not. This is exactly what we have d_flags for, so that we can avoid touching extra cache lines for the common case. So just add another bit for "is this dentry on the LRU". Finally, mark the tests properly likely/unlikely, so that the common fast-paths are dense in the instruction stream. This makes the profiles look much saner. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: check unlinked ancestors before mountMiklos Szeredi2013-09-051-0/+33
| | | | | | | | | | | | | | | | We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Nor do we prevent mounts to be added to the disconnected subtree using relative paths after the d_drop(). This patch fixes these issues by checking for unlinked (unhashed, non-root) ancestors before proceeding with the mount. This is done with rename seqlock taken for write and with ->d_lock grabbed on each ancestor in turn, including our dentry itself. This ensures that the only one of check_submounts_and_drop() or has_unlinked_ancestor() can succeed. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: check submounts and drop atomicallyMiklos Szeredi2013-09-051-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | We check submounts before doing d_drop() on a non-empty directory dentry in NFS (have_submounts()), but we do not exclude a racing mount. Process A: have_submounts() -> returns false Process B: mount() -> success Process A: d_drop() This patch prepares the ground for the fix by doing the following operations all under the same rename lock: have_submounts() shrink_dcache_parent() d_drop() This is actually an optimization since have_submounts() and shrink_dcache_parent() both traverse the same dentry tree separately. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> CC: David Howells <dhowells@redhat.com> CC: Steven Whitehouse <swhiteho@redhat.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: add d_walk()Miklos Szeredi2013-09-051-161/+148Star
| | | | | | | | | | | This one replaces three instances open coded tree walking (have_submounts, select_parent, d_genocide) with a common helper. In addition to slightly reducing the kernel size, this simplifies the callers and makes them less bug prone. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: restructure d_genocide()Miklos Szeredi2013-09-051-8/+4Star
| | | | | | | | | | | It shouldn't matter when we decrement the refcount during the walk as long as we do it exactly once. Restructure d_genocide() to do the killing on entering the dentry instead of when leaving it. This helps creating a common helper for tree walking. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-051-2/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
| * vfs: call d_op->d_prune() before unhashing dentryYan, Zheng2013-09-041-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The d_prune dentry operation is used to notify filesystem when VFS about to prune a hashed dentry from the dcache. There are three code paths that prune dentries: shrink_dcache_for_umount_subtree(), prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases() case, VFS unhashes the dentry first, then call the d_prune dentry operation. This confuses ceph_d_prune() (ceph uses the d_prune dentry operation to maintain a flag indicating whether the complete contents of a directory are in the dcache, pruning unhashed dentry does not affect dir's completeness) This patch fixes the issue by calling the d_prune dentry operation in d_prune_aliases(), before unhashing the dentry. Also make VFS only call the d_prune dentry operation for hashed dentry, to avoid calling the d_prune dentry operation twice when dentry is pruned by d_prune_aliases(). Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()Linus Torvalds2013-09-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c and re-implements it using the lockref infrastructure instead. It also adds a lot of comments about what is actually going on, because turning a dentry that was looked up using RCU into a long-lived reference counted entry is one of the more subtle parts of the rcu walk. We also used to be _particularly_ subtle in unlazy_walk() where we re-validate both the dentry and its parent using the same sequence count. We used to do it by nesting the locks and then verifying the sequence count just once. That was silly, because nested locking is expensive, but the sequence count check is not. So this just re-validates the dentry and the parent separately, avoiding the nested locking, and making the lockref lookup possible. Acked-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: use lockref_get_not_zero() for optimistic lockless dget_parent()Waiman Long2013-09-021-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A valid parent pointer is always going to have a non-zero reference count, but if we look up the parent optimistically without locking, we have to protect against the (very unlikely) race against renaming changing the parent from under us. We do that by using lockref_get_not_zero(), and then re-checking the parent pointer after getting a valid reference. [ This is a re-implementation of a chunk from the original patch by Waiman Long: "dcache: Enable lockless update of dentry's refcount". I've completely rewritten the patch-series and split it up, but I'm attributing this part to Waiman as it's close enough to his earlier patch - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: make the dentry cache use the lockref infrastructureWaiman Long2013-08-291-34/+23Star
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | This just replaces the dentry count/lock combination with the lockref structure that contains both a count and a spinlock, and does the mechanical conversion to use the lockref infrastructure. There are no semantic changes here, it's purely syntactic. The reference lockref implementation uses the spinlock exactly the same way that the old dcache code did, and the bulk of this patch is just expanding the internal "d_count" use in the dcache code to use "d_lockref.count" instead. This is purely preparation for the real change to make the reference count updates be lockless during the 3.12 merge window. [ As with the previous commit, this is a rewritten version of a concept originally from Waiman, so credit goes to him, blame for any errors goes to me. Waiman's patch had some semantic differences for taking advantage of the lockless update in dget_parent(), while this patch is intentionally a pure search-and-replace change with no semantic changes. - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cope with potentially long ->d_dname() output for shmem/hugetlbAl Viro2013-08-241-0/+11
| | | | | | | | | | dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-07-031-43/+23Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
| * Don't pass inode to ->d_hash() and ->d_compare()Linus Torvalds2013-06-291-17/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | Instances either don't look at it at all (the majority of cases) or only want it to find the superblock (which can be had as dentry->d_sb). A few cases that want more are actually safe with dentry->d_inode - the only precaution needed is the check that it hadn't been replaced with NULL by rmdir() or by overwriting rename(), which case should be simply treated as cache miss. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill find_inode_number()Al Viro2013-06-291-29/+0Star
| | | | | | | | | | | | | | | | the only remaining caller (in ncpfs) is guaranteed to return 0 - we only hit it if we'd just checked that there's no dentry with such name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * [O_TMPFILE] it's still short a few helpers, but infrastructure should be OK ↵Al Viro2013-06-291-0/+16
| | | | | | | | | | | | now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | GFS2: Add atomic_open supportSteven Whitehouse2013-06-141-2/+9
|/ | | | | | | | | | | | | | | | | | | | | I've restricted atomic_open to only operate on regular files, although I still don't understand why atomic_open should not be possible also for directories on GFS2. That can always be added in later though, if it makes sense. The ->atomic_open function can be passed negative dentries, which in most cases means either ENOENT (->lookup) or a call to d_instantiate (->create). In the GFS2 case though, we need to actually perform the look up, since we do not know whether there has been a new inode created on another node. The look up calls d_splice_alias which then tries to rehash the dentry - so the solution here is to simply check for that in d_splice_alias. The same issue is likely to affect any other cluster filesystem implementing ->atomic_open Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields fieldses org> Cc: Jeff Layton <jlayton@redhat.com>
* vfs: use list_move instead of list_del/list_addWei Yongjun2013-05-041-2/+1Star
| | | | | | | Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: remove dentry_lru_prune()Yan, Zheng2013-05-041-22/+9Star
| | | | | | | | | | | | | When pruning a dentry, its ancestor dentry can also be pruned. But the ancestor dentry does not go through dput(), so it does not get put on the dentry LRU. Hence associating d_prune with removing the dentry from the LRU is the wrong. The fix is remove dentry_lru_prune(). Call file system's d_prune() callback directly when pruning dentries. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs/dcache.c: add cond_resched() to shrink_dcache_parent()Greg Thelen2013-05-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call cond_resched() in shrink_dcache_parent() to maintain interactivity. Before this patch: void shrink_dcache_parent(struct dentry * parent) { while ((found = select_parent(parent, &dispose)) != 0) shrink_dentry_list(&dispose); } select_parent() populates the dispose list with dentries which shrink_dentry_list() then deletes. select_parent() carefully uses need_resched() to avoid doing too much work at once. But neither shrink_dcache_parent() nor its called functions call cond_resched(). So once need_resched() is set select_parent() will return single dentry dispose list which is then deleted by shrink_dentry_list(). This is inefficient when there are a lot of dentry to process. This can cause softlockup and hurts interactivity on non preemptable kernels. This change adds cond_resched() in shrink_dcache_parent(). The benefit of this is that need_resched() is quickly cleared so that future calls to select_parent() are able to efficiently return a big batch of dentry. These additional cond_resched() do not seem to impact performance, at least for the workload below. Here is a program which can cause soft lockup if other system activity sets need_resched(). int main() { struct rlimit rlim; int i; int f[100000]; char buf[20]; struct timeval t1, t2; double diff; /* cleanup past run */ system("rm -rf x"); /* boost nfile rlimit */ rlim.rlim_cur = 200000; rlim.rlim_max = 200000; if (setrlimit(RLIMIT_NOFILE, &rlim)) err(1, "setrlimit"); /* make directory for files */ if (mkdir("x", 0700)) err(1, "mkdir"); if (gettimeofday(&t1, NULL)) err(1, "gettimeofday"); /* populate directory with open files */ for (i = 0; i < 100000; i++) { snprintf(buf, sizeof(buf), "x/%d", i); f[i] = open(buf, O_CREAT); if (f[i] == -1) err(1, "open"); } /* close some of the files */ for (i = 0; i < 85000; i++) close(f[i]); /* unlink all files, even open ones */ system("rm -rf x"); if (gettimeofday(&t2, NULL)) err(1, "gettimeofday"); diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) - ((double)t1.tv_sec * 1000000 + t1.tv_usec)); printf("done: %g elapsed\n", diff/1e6); return 0; } Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Nest rename_lock inside vfsmount_lockAl Viro2013-03-261-5/+11
| | | | | | | | | | | | | | | | | ... lest we get livelocks between path_is_under() and d_path() and friends. The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks; it is possible to have thread B spin on attempt to take lock shared while thread A is already holding it shared, if B is on lower-numbered CPU than A and there's a thread C spinning on attempt to take the same lock exclusive. As the result, we need consistent ordering between vfsmount_lock (lglock) and rename_lock (seq_lock), even though everything that takes both is going to take vfsmount_lock only shared. Spotted-by: Brad Spengler <spender@grsecurity.net> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* hlist: drop the node parameter from iteratorsSasha Levin2013-02-281-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry opJeff Layton2013-02-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following set of operations on a NFS client and server will cause server# mkdir a client# cd a server# mv a a.bak client# sleep 30 # (or whatever the dir attrcache timeout is) client# stat . stat: cannot stat `.': Stale NFS file handle Obviously, we should not be getting an ESTALE error back there since the inode still exists on the server. The problem is that the lookup code will call d_revalidate on the dentry that "." refers to, because NFS has FS_REVAL_DOT set. nfs_lookup_revalidate will see that the parent directory has changed and will try to reverify the dentry by redoing a LOOKUP. That of course fails, so the lookup code returns ESTALE. The problem here is that d_revalidate is really a bad fit for this case. What we really want to know at this point is whether the inode is still good or not, but we don't really care what name it goes by or whether the dcache is still valid. Add a new d_op->d_weak_revalidate operation and have complete_walk call that instead of d_revalidate. The intent there is to allow for a "weaker" d_revalidate that just checks to see whether the inode is still good. This is also gives us an opportunity to kill off the FS_REVAL_DOT special casing. [AV: changed method name, added note in porting, fixed confusion re having it possibly called from RCU mode (it won't be)] Cc: NeilBrown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* d_hash_and_lookup(): export, switch open-coded instancesAl Viro2013-02-261-12/+11Star
| | | | | | | | * calling conventions change - ERR_PTR() is returned on ->d_hash() errors; NULL is just for dcache miss now. * exported, open-coded instances in ncpfs and cifs converted. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __d_materialise_unique() is too genericAl Viro2013-02-231-14/+5Star
| | | | | | | Its first argument is always non-root, while the second one is always root. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* constify d_lookup() argumentsAl Viro2013-02-231-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* constify __d_lookup() argumentsAl Viro2013-02-231-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: remove d_path_with_unreachableJeff Layton2013-02-231-31/+0Star
| | | | | | | The last caller was removed >2 years ago in commit 7b2a69ba7. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: d_obtain_alias() needs to use "/" as default name.NeilBrown2012-12-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | NFS appears to use d_obtain_alias() to create the root dentry rather than d_make_root. This can cause 'prepend_path()' to complain that the root has a weird name if an NFS filesystem is lazily unmounted. e.g. if "/mnt" is an NFS mount then { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; } will cause a WARN message like WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0() ... Root dentry has weird name <> to appear in kernel logs. So change d_obtain_alias() to use "/" rather than "" as the anonymous name. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: remove DCACHE_NEED_LOOKUPJeff Layton2012-12-201-32/+1Star
| | | | | | | | | | | The code that relied on that flag was ripped out of btrfs quite some time ago, and never added back. Josef indicated that he was going to take a different approach to the problem in btrfs, and that we could just eliminate this flag. Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-10-031-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
| * vfs: delete surplus inode NULL checkAlan Cox2012-09-271-1/+1
| | | | | | | | | | | | | | | | | | Each iteration of d_delete we reload inode from dentry->d_inode and then call S_ISDIR(inode-i_mode), so inode cannot possibly be NULL shortly afterwards unless something went horribly wrong. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vfs: dcache: fix deadlock in tree traversalMiklos Szeredi2012-09-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IBM reported a deadlock in select_parent(). This was found to be caused by taking rename_lock when already locked when restarting the tree traversal. There are two cases when the traversal needs to be restarted: 1) concurrent d_move(); this can only happen when not already locked, since taking rename_lock protects against concurrent d_move(). 2) racing with final d_put() on child just at the moment of ascending to parent; rename_lock doesn't protect against this rare race, so it can happen when already locked. Because of case 2, we need to be able to handle restarting the traversal when rename_lock is already held. This patch fixes all three callers of try_to_ascend(). IBM reported that the deadlock is gone with this patch. [ I rewrote the patch to be smaller and just do the "goto again" if the lock was already held, but credit goes to Miklos for the real work. - Linus ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | trivial select_parent documentation fixJ. Bruce Fields2012-09-281-1/+1
|/ | | | | | | "Search list for X" sounds like you're trying to find X on a list. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()Miklos Szeredi2012-09-181-2/+2
| | | | | | | | | | | | | | | | | | | | | IBM reported a soft lockup after applying the fix for the rename_lock deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression in kernel 2.6.38") was found to be the culprit. The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the dentry was killed. This flag can be set on non-killed dentries too, which results in infinite retries when trying to traverse the dentry tree. This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is only set in d_kill() and makes try_to_ascend() test only this flag. IBM reported successful test results with this patch. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* __d_unalias() should refuse to move mountpointsAl Viro2012-07-141-4/+5
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: switch i_dentry/d_alias to hlistAl Viro2012-07-141-15/+18
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of ->mnt_longtermAl Viro2012-07-141-1/+1
| | | | | | | | | it's enough to set ->mnt_ns of internal vfsmounts to something distinct from all struct mnt_namespace out there; then we can just use the check for ->mnt_ns != NULL in the fast path of mntput_no_expire() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Revert "vfs: stop d_splice_alias creating directory aliases"Linus Torvalds2012-06-081-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit 3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up cleanup). We're chasing an elusive bug that Dave Jones can apparently reproduce using his system call fuzzer tool, and that looks like some kind of locking ordering problem on the directory i_mutex chain. Our i_mutex locking is rather complex, and depends on the topological ordering of the directories, which is why we have been very wary of splicing directory entries around. Of course, we really don't want to ever see aliased unconnected directories anyway, so none of this should ever happen, but this revert aims to basically get us back to a known older state. Bruce points to some of the previous discussion at http://marc.info/?i=<20110310105821.GE22723@ZenIV.linux.org.uk> and in particular a long post from Neil: http://marc.info/?i=<20110311150749.2fa2be66@notabene.brown> It should be noted that it's possible that Dave's problems come from other changes altohgether, including possibly just the fact that Dave constantly is teachning his fuzzer new tricks. So what appears to be a new bug could in fact be an old one that just gets newly triggered, but reverting these patches as "still under heavy discussion" is the right thing regardless. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: remove unused __d_splice_alias argumentJ. Bruce Fields2012-05-311-8/+5Star
| | | | | | | | Nobody sets want_disconn any more. Reported-by: Peng Tao <bergwolf@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: stop d_splice_alias creating directory aliasesJ. Bruce Fields2012-05-311-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A directory should never have more than one dentry pointing to it. But d_splice_alias() will add one if it finds a directory with an already-existing non-DISCONNECTED dentry. I can't find an obvious reproducer, but I also can't see what prevents d_splice_alias() from encountering such a case. It therefore seems safest to allow d_splice_alias to use any dentry it finds. (Prior to the removal of dentry_unhash() from vfs_rmdir(), around v3.0, this could cause an nfsd deadlock like this: - Somebody attempts to remove a non-empty directory. - The dentry_unhash() in vfs_rmdir() unhashes the dentry pointing to the non-empty directory. - ->rmdir() then fails with -ENOTEMPTY - Before the vfs_rmdir() caller reaches dput(), an nfsd process in rename looks up the directory by filehandle; at the end of that lookup, this dentry is found by d_alloc_anon(), and a reference is taken on it, preventing dput() from removing it. - A regular lookup of the directory calls d_splice_alias(), finds only an unhashed (not a DISCONNECTED) dentry, and insteads adds a new one, so the directory now has two dentries. - The nfsd process in rename, which was previously looking up the source directory of the rename, now looks up the target directory (which is the same), and gets the dentry newly created by the previous lookup. - The rename, seeing two different dentries, assumes this is a cross-directory rename and attempts to take the i_mutex on the directory twice. That reproducer no longer exists, but I don't think there was anything fundamentally incorrect about the vfs_rmdir() behavior there, so I think the real fault was here in d_splice_alias().) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* brlocks/lglocks: API cleanupsAndi Kleen2012-05-301-2/+2
| | | | | | | | | | | | | | | | | | | | | lglocks and brlocks are currently generated with some complicated macros in lglock.h. But there's no reason to not just use common utility functions and put all the data into a common data structure. In preparation, this patch changes the API to look more like normal function calls with pointers, not magic macros. The patch is rather large because I move over all users in one go to keep it bisectable. This impacts the VFS somewhat in terms of lines changed. But no actual behaviour change. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* mm: add a low limit to alloc_large_system_hashTim Bird2012-05-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>