summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* nfs: prepare to share nfs_set_portJ. Bruce Fields2008-10-082-19/+20
| | | | | | | We plan to use this function elsewhere. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: replace while loop by for loops in nfs_follow_referralJ. Bruce Fields2008-10-081-12/+5Star
| | | | | | | Whoever wrote this had a bizarre allergy to for loops. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: break up nfs_follow_referralJ. Bruce Fields2008-10-081-38/+46
| | | | | | | This function is a little longer and more deeply nested than necessary. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: authenticated deep mountingEG Keizer2008-10-082-4/+26
| | | | | | | | | | | | | | | | | | | | | | | Allow mount to do authenticated mounts below the root of the exported tree. The wording in RFC 2623, sec 2.3.2. allows fsinfo with UNIX authentication on the root of the export. Mounts are not always done on the root of the exported tree. Especially autoumounts often mount below the root of the exported tree. Some server implementations (justly) require full authentication for the so-called deep mounts. The old code used AUTH_SYS only. This caused deep mounts to fail on systems requiring stronger authentication.. The client should try both authentication types and use the first one that succeeds. This method was already partially implemented. This patch completes the implementation for NFS2 and NFS3. This patch was developed to allow Debian systems to automount home directories on Solaris servers with krb5 authentication. Tested on kernel 2.6.24-etchnhalf.1 Signed-off-by: E.G. Keizer <keie@few.vu.nl> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: missing nfs_fattr_init in nfs3_proc_getacl and nfs3_proc_setacls ↵Jeff Layton2008-10-081-0/+2
| | | | | | | | | | | | | | (resend #2) The fattrs used in the NFSv3 getacl/setacl calls are not being properly initialized. This occasionally causes nfs_update_inode to fall into NFSv4 specific codepaths when handling post-op attrs from these calls. Thanks to Cai Qian for noticing the spurious NFSv4 messages in debug output from a v3 mount... Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: remove an obsolete nfs_flock commentJ. Bruce Fields2008-10-081-7/+0Star
| | | | | | | We *do* now allow bsd flocks over nfs. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: BUG_ON in nfs_follow_mountpointDenis V. Lunev2008-10-081-1/+4
| | | | | | | | | | | | | | Unfortunately, BUG_ON(IS_ROOT(dentry)) can happen inside nfs_follow_mountpoint with NFS running Fedora 8 using a specific setup. https://bugzilla.redhat.com/show_bug.cgi?id=458622 So, the situation should be handled on NFS client gracefully. Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: ERR_PTR is expected on failure from nfs_do_clone_mountDenis V. Lunev2008-10-081-1/+1
| | | | | | | Replace NULL with ERR_PTR(-EINVAL). Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* fix fs/nfs/nfsroot.c compilationAdrian Bunk2008-10-071-1/+1
| | | | | | | | | | | | | | | | | | This patch fixes the following compile error caused by commit f9247273cb69ba101877e946d2d83044409cc8c5 (UFS: add const to parser token tabl): <-- snip --> ... CC fs/nfs/nfsroot.o /home/bunk/linux/kernel-2.6/git/linux-2.6/fs/nfs/nfsroot.c:130: error: tokens causes a section type conflict make[3]: *** [fs/nfs/nfsroot.o] Error 1 <-- snip --> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Allow concurrent inode revalidationTrond Myklebust2008-10-071-41/+2Star
| | | | | | | | | Currently, if two processes are both trying to revalidate metadata for the same inode, they will find themselves being serialised. There is no good justification for this now that we have improved our ability to detect stale attribute data, so we should remove that serialisation. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Fix up nfs_setattr_update_inode()Trond Myklebust2008-10-071-1/+1
| | | | | | Ensure that it sets the inode metadata under the correct spinlock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Don't clear nfsi->cache_validity in nfs_check_inode_attributes()Trond Myklebust2008-10-071-7/+0Star
| | | | | | | | | If we're merely checking the inode attributes because we suspect that the 'updated' attributes returned by the RPC call are stale, then we shouldn't be doing weak cache consistency updates or clearing the cache_validity flags. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Convert __nfs_revalidate_inode() to use nfs_refresh_inode()Trond Myklebust2008-10-071-4/+1Star
| | | | | | | | | | | | | In the case where there are parallel RPC calls to the same inode, we may receive stale metadata due to the lack of ordering, hence the sanity checking of metadata in nfs_refresh_inode(). Currently, __nfs_revalidate_inode() is calling nfs_update_inode() directly, without any further sanity checks, and hence may end up setting the inode up with stale metadata. Fix is to use nfs_refresh_inode() instead of nfs_update_inode(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Fix nfs_post_op_update_inode_force_wcc()Trond Myklebust2008-10-071-8/+27
| | | | | | | | | | If we believe that the attributes are old (see nfs_refresh_inode()), then we shouldn't force an update. Also ensure that we hold the inode->i_lock across attribute checks and the call to nfs_refresh_inode_locked() to ensure that we don't race with other attribute updates. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Fix the NFS attribute updateTrond Myklebust2008-10-071-3/+41
| | | | | | | | | | | | | | | | Currently nfs_refresh_inode() will only update the inode metadata if it sees that the RPC call that returned the nfs_fattr was started after the last update of the inode. This means that if we have parallel RPC calls to the same inode (when sending WRITE calls, for instance), we may often miss updates. This patch attempts to recover those missed updates by also accepting them if the ctime in the nfs_fattr is more recent than the inode's cached ctime. It also recovers the case where the file size has increased, but the ctime has not been updated due to limited ctime resolution. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Clean up nfs_refresh_inode() and nfs_post_op_update_inode()Trond Myklebust2008-10-071-7/+14
| | | | | | | | | Try to avoid taking and dropping the inode->i_lock more than once. Do so by moving the code in nfs_refresh_inode() that needs to be done under the spinlock into a function nfs_refresh_inode_locked(), and then having both nfs_refresh_inode() and nfs_post_op_update_inode() call it directly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Add mount options for controlling the lookup cacheTrond Myklebust2008-10-071-0/+43
| | | | | | | | | | | Add the following NFS-specific mount options to the parser. -o lookupcache=all /* Default: cache positive & negative dentries */ -o lookupcache=pos[itive] /* Don't cache negative dentries */ -o lookupcache=none /* Strict revalidation of all dentries */ Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Don't apply NFS_MOUNT_FLAGMASK to text-based mountsTrond Myklebust2008-10-072-3/+3
| | | | | | | | | | The point of introducing text-based mounts was to allow us to add functionality without having to worry about legacy binary mount formats. The mask should be there in order to ensure that binary formats don't start enabling features that they cannot support. There is no justification for applying it to the text mount path. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Add options for finer control of the lookup cacheTrond Myklebust2008-10-071-0/+4
| | | | | | | | | | | | Add the flag NFS_MOUNT_LOOKUP_CACHE_NONEG to turn off the caching of negative dentries. In reality what we do is to force nfs_lookup_revalidate() to always discard negative dentries. Add the flag NFS_MOUNT_LOOKUP_CACHE_NONE for enforcing stricter revalidation of dentries. It forces the revalidate code to always do a lookup instead of just checking the cached mtime of the parent directory. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Clean up nfs_sb_active/nfs_sb_deactiveTrond Myklebust2008-10-074-21/+13Star
| | | | | | | | | Instead of causing umount requests to block on server->active_wq while the asynchronous sillyrename deletes are executing, we can use the sb->s_active counter to obtain a reference to the super_block, and then release that reference in nfs_async_unlink_release(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFS: Fix nfs_file_llseek()Trond Myklebust2008-10-071-4/+7
| | | | | | | | | | | | | | After the BKL removal patches were applied to the rest of the NFS code, the BKL protection in nfs_file_llseek() is no longer sufficient to ensure that inode->i_size is read safely in generic_file_llseek_unlocked(). In order to fix the situation, we either have to replace the naked read of inode->i_size in generic_file_llseek_unlocked() with i_size_read(), or the whole thing needs to be executed under the inode->i_lock; In order to avoid disrupting other filesystems, avoid touching generic_file_llseek_unlocked() for now... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* mm: tiny-shmem nommu fixNick Piggin2008-10-031-1/+1
| | | | | | | | | | | | | | | | | | | | The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* inotify: fix lock ordering wrt do_page_fault's mmap_semNick Piggin2008-10-031-7/+20
| | | | | | | | | | | | | Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm owner: fix race between swapoff and exitBalbir Singh2008-09-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix NULL pointer dereference in proc_sys_compareLinus Torvalds2008-09-291-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://oss.sgi.com:8090/xfs/linux-2.6Linus Torvalds2008-09-261-91/+3Star
|\ | | | | | | | | | | * git://oss.sgi.com:8090/xfs/linux-2.6: [XFS] Remove xfs_iext_irec_compact_full() [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().
| * [XFS] Remove xfs_iext_irec_compact_full()Lachlan McIlroy2008-09-261-92/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yet another bug was found in xfs_iext_irec_compact_full() and while the source of the bug was found it wasn't an easy task to track it down because the conditions are very difficult to reproduce. A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their significant effort in tracking down the source of this corruption. xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost identical - they both compact indirect extent lists by moving extents from subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only moves extents if all of the extents in the next buffer will fit into the empty space in the buffer before it. xfs_iext_irec_compact_full() will go a step further and move part of the next buffer if all the extents wont fit. It will then shift the remaining extents in the next buffer up to the start of the buffer. The bug here was that we did not update er_extoff and this caused extent list corruption. It does not appear that this extra functionality gains us much. Calling xfs_iext_irec_compact_pages() instead will do a good enough job at compacting the indirect list and will be quicker too. For the case in xfs_iext_indirect_to_direct() the total number of extents in the indirect list will fit into one buffer so we will never need the extra functionality of xfs_iext_irec_compact_full() there. Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the buffers will never overlap) so we don't want the performance hit that can incur. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32166a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
| * [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().Lachlan McIlroy2008-09-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If we don't move all the records from the next buffer into the current buffer then we need to update the er_extoff field of the next buffer as we shift the remaining records to the start of the buffer. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32165a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Russell Cattelan <cattelan@thebarn.com>
* | Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6Linus Torvalds2008-09-266-9/+15
|\ \ | |/ |/| | | | | | | | | | | * 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: fix printk format warnings UBIFS: remove incorrect assert UBIFS: TNC / GC race fixes UBIFS: create the name of the background thread in every case
| * UBIFS: fix printk format warningsAlexander Beregalov2008-09-182-2/+2
| | | | | | | | | | | | | | | | | | | | | | fs/ubifs/dir.c:428: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'long unsigned int' fs/ubifs/debug.c:541: warning: format '%llu' expects type 'long long unsigned int', but argument 2 has type 'long unsigned int' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: remove incorrect assertAdrian Hunter2008-09-171-1/+0Star
| | | | | | | | | | | | | | | | | | The assert was not valid because one of the variables 'taken_empty_lebs' has transient values out of sync with the other variables. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: TNC / GC race fixesAdrian Hunter2008-09-172-4/+12
| | | | | | | | | | | | | | | | | | - update GC sequence number if any nodes may have been moved even if GC did not finish the LEB - don't ignore error return when reading Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
| * UBIFS: create the name of the background thread in every caseSebastian Siewior2008-09-171-2/+1Star
| | | | | | | | | | | | | | | | | | If the ubifs partition is mounted RO and then remounted RW we end up with no thread name in ubifs_remount_rw() and the thread appears nameless. Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* | 9p: use an IS_ERR test rather than a NULL testJulien Brunel2008-09-241-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of error, the function p9_client_walk returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = p9_client_walk(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | [XFS] Don't do I/O beyond eof when unreserving spaceLachlan McIlroy2008-09-171-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unreserving space with boundaries that are not block aligned we round up the start and round down the end boundaries and then use this function, xfs_zero_remaining_bytes(), to zero the parts of the blocks that got dropped during the rounding. The problem is we don't consider if these blocks are beyond eof. Worse still is if we encounter delayed allocations beyond eof we will try to use the magic delayed allocation block number as a real block number. If the file size is ever extended to expose these blocks then we'll go through xfs_zero_eof() to zero them anyway. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:32055a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Fix use-after-free with buffersLachlan McIlroy2008-09-171-24/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a use-after-free issue where log completions access buffers via the buffer log item and the buffer has already been freed. Fix this by taking a reference on the buffer when attaching the buffer log item and release the hold when the buffer log item is detached and we no longer need the buffer. Also create a new function xfs_buf_item_free() to combine some common code. SGI-PV: 985757 SGI-Modid: xfs-linux-melb:xfs-kern:32025a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Prevent lockdep false positives when locking two inodes.David Chinner2008-09-172-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we call xfs_lock_two_inodes() to grab both the iolock and the ilock, then drop the ilocks on both inodes, then grab them again (as xfs_swap_extents() does) then lockdep will report a locking order problem. This is a false positive. To avoid this, disallow xfs_lock_two_inodes() fom locking both inode locks at once - force calers to make two separate calls. This means that nested dropping and regaining of the ilocks will retain the same lockdep subclass and so lockdep will not see anything wrong with this code. SGI-PV: 986238 SGI-Modid: xfs-linux-melb:xfs-kern:31999a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Fix barrier status change detection.David Chinner2008-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code in xlog_iodone() uses the wrong macro to check if the barrier has been cleared due to an EOPNOTSUPP error form the lower layer. SGI-PV: 986143 SGI-Modid: xfs-linux-melb:xfs-kern:31984a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Prevent direct I/O from mapping extents beyond eofLachlan McIlroy2008-09-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the help from some tracing I found that we try to map extents beyond eof when doing a direct I/O read. It appears that the way to inform the generic direct I/O path (ie do_direct_IO()) that we have breached eof is to return an unmapped buffer from xfs_get_blocks_direct(). This will cause do_direct_IO() to jump to the hole handling code where is will check for eof and then abort. This problem was found because a direct I/O read was trying to map beyond eof and was encountering delayed allocations. The delayed allocations beyond eof are speculative allocations and they didn't get converted when the direct I/O flushed the file because there was only enough space in the current AG to convert and write out the dirty pages within eof. Note that xfs_iomap_write_allocate() wont necessarily convert all the delayed allocation passed to it - it will return after allocating the first extent - so if the delayed allocation extends beyond eof then it will stay that way. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:31929a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | [XFS] Fix regression introduced by remount fixupChristoph Hellwig2008-09-171-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logically we would return an error in xfs_fs_remount code to prevent users from believing they might have changed mount options using remount which can't be changed. But unfortunately mount(8) adds all options from mtab and fstab to the mount arguments in some cases so we can't blindly reject options, but have to check for each specified option if it actually differs from the currently set option and only reject it if that's the case. Until that is implemented we return success for every remount request, and silently ignore all options that we can't actually change. SGI-PV: 985710 SGI-Modid: xfs-linux-melb:xfs-kern:31908a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* | [XFS] Move memory allocations for log tracing out of the critical pathLachlan McIlroy2008-09-172-21/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory allocations for log->l_grant_trace and iclog->ic_trace are done on demand when the first event is logged. In xlog_state_get_iclog_space() we call xlog_trace_iclog() under a spinlock and allocating memory here can cause us to sleep with a spinlock held and deadlock the system. For the log grant tracing we use KM_NOSLEEP but that means we can lose trace entries. Since there is no locking to serialize the log grant tracing we could race and have multiple allocations and leak memory. So move the allocations to where we initialize the log/iclog structures. Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace since it's not even used. SGI-PV: 983738 SGI-Modid: xfs-linux-melb:xfs-kern:31896a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* | rescan_partitions(): make device capacity errors non-fatalAndrew Morton2008-09-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Herton Krzesinski reports that the error-checking changes in 04ebd4aee52b06a2c38127d9208546e5b96f3a19 ("block/ioctl.c and fs/partition/check.c: check value returned by add_partition") cause his buggy USB camera to no longer mount. "The camera is an Olympus X-840. The original issue comes from the camera itself: its format program creates a partition with an off by one error". Buggy devices happen. It is better for the kernel to warn and to proceed with the mount. Reported-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br> Cc: Abdel Benamrouche <draconux@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: ifdef Quicklists in /proc/meminfoHugh Dickins2008-09-131-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A "Quicklists: 0 kB" line has just started appearing in /proc/meminfo, but most architectures (including x86) don't have them configured, so #ifdef it, like the highmem lines. And those architectures which do have quicklists configured are using them for page tables: so let's place it next to PageTables. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bfs: fix Lockdep warningEric Sesterhenn2008-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes: ============================================= [ INFO: possible recursive locking detected ] 2.6.27-rc5-00283-g70bb089 #68 --------------------------------------------- touch/6855 is trying to acquire lock: (&info->bfs_lock){--..}, at: [<c02262f5>] bfs_delete_inode+0x9e/0x18c but task is already holding lock: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 other info that might help us debug this: 2 locks held by touch/6855: #0: (&type->i_mutex_dir_key#5){--..}, at: [<c018ad13>] do_filp_open+0x10b/0x62f #1: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 stack backtrace: Pid: 6855, comm: touch Not tainted 2.6.27-rc5-00283-g70bb089 #68 [<c013e769>] validate_chain+0x458/0x9f4 [<c013bece>] ? trace_hardirqs_off+0xb/0xd [<c013f36b>] __lock_acquire+0x666/0x6e0 [<c013f440>] lock_acquire+0x5b/0x77 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c06aab74>] mutex_lock_nested+0xbc/0x234 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] bfs_delete_inode+0x9e/0x18c [<c0226257>] ? bfs_delete_inode+0x0/0x18c [<c01925e1>] generic_delete_inode+0x94/0xfe [<c019265d>] generic_drop_inode+0x12/0x12f [<c0191b7e>] iput+0x4b/0x4e [<c0226d1e>] bfs_create+0x163/0x187 [<c0188b42>] vfs_create+0xa6/0x114 [<c018adb5>] do_filp_open+0x1ad/0x62f [<c0107cdc>] ? native_sched_clock+0x82/0x96 [<c06ac309>] ? _spin_unlock+0x27/0x3c [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c06ae2f4>] ? sub_preempt_count+0x9d/0xab [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c0180391>] do_sys_open+0x42/0xb8 [<c041d564>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c0180449>] sys_open+0x1e/0x26 [<c01038bd>] sysenter_do_call+0x12/0x31 ======================= The problem is that we don't unlock the bfs->lock mutex before calling iput (we do in the other cases). Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: more debugging for "already registered" caseAlexey Dobriyan2008-09-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Print parent directory name as well. The aim is to catch non-creation of parent directory when proc_mkdir will return NULL and all subsequent registrations go directly in /proc instead of intended directory. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Fixed insane printk string while at it. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for_linus' of ↵Linus Torvalds2008-09-112-25/+20Star
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: add llseek method udf: Fix error paths in udf_new_inode() udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)
| * | udf: add llseek methodChristoph Hellwig2008-09-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UDF currently doesn't set a llseek method for regular files, which means it will fall back to default_llseek. This means no one can seek beyond 2 Gigabytes on udf, and that there's not protection vs the i_size updates from writers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | udf: Fix error paths in udf_new_inode()Jan Kara2008-08-191-23/+18Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I case we failed to allocate memory for inode when creating it, we did not properly free block already allocated for this inode. Move memory allocation before the block allocation which fixes this issue (thanks for the idea go to Ingo Oeser <ioe-lkml@rameria.de>). Also remove a few superfluous initializations already done in udf_alloc_inode(). Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)Jan Kara2008-08-191-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A memory allocation inside alloc_mutex must not recurse back into the filesystem itself because that leads to lock inversion between iprune_mutex and alloc_mutex (and thus to deadlocks - see traces below). alloc_mutex is actually needed only to update allocation statistics in the superblock so we can drop it before we start allocating memory for the inode. tar D ffff81015b9c8c90 0 6614 6612 ffff8100d5a21a20 0000000000000086 0000000000000000 00000000ffff0000 ffff81015b9c8c90 ffff81015b8f0cd0 ffff81015b9c8ee0 0000000000000000 0000000000000003 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffff8027f8c2>] shrink_icache_memory+0x38/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff802579db>] try_to_free_pages+0x221/0x30d [<ffffffff8025657e>] isolate_pages_global+0x0/0x31 [<ffffffff8025324b>] __alloc_pages_internal+0x252/0x3ab [<ffffffff8026b08b>] cache_alloc_refill+0x22e/0x47b [<ffffffff8026ae37>] kmem_cache_alloc+0x3b/0x61 [<ffffffff8026b15b>] cache_alloc_refill+0x2fe/0x47b [<ffffffff8026b34e>] __kmalloc+0x76/0x9c [<ffffffffa00751f2>] :udf:udf_new_inode+0x202/0x2e2 [<ffffffffa007ae5e>] :udf:udf_create+0x2f/0x16d [<ffffffffa0078f27>] :udf:udf_lookup+0xa6/0xad ... kswapd0 D ffff81015b9d9270 0 125 2 ffff81015b903c28 0000000000000046 ffffffff8028cbb0 00000000fffffffb ffff81015b9d9270 ffff81015b8f0cd0 ffff81015b9d94c0 000000000271b490 ffffe2000271b458 ffffe2000271b420 ffffe20002728dc8 ffffe20002728d90 Call Trace: [<ffffffff8028cbb0>] __set_page_dirty+0xeb/0xf5 [<ffffffff8025403a>] get_dirty_limits+0x1d/0x22f [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffffa0073f58>] :udf:udf_bitmap_free_blocks+0x47/0x1eb [<ffffffffa007df31>] :udf:udf_discard_prealloc+0xc6/0x172 [<ffffffffa007875a>] :udf:udf_clear_inode+0x1e/0x48 [<ffffffff8027f121>] clear_inode+0x6d/0xc4 [<ffffffff8027f7f2>] dispose_list+0x56/0xee [<ffffffff8027fa5a>] shrink_icache_memory+0x1d0/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff80257e93>] kswapd+0x346/0x447 ... Reported-by: Tibor Tajti <tibor.tajti@gmail.com> Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
* | | ocfs2: Fix a bug in direct IO read.Tao Ma2008-09-101-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2 will become read-only if we try to read the bytes which pass the end of i_size. This can be easily reproduced by following steps: 1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse. 2. create a small file(say less than 100 bytes) and we will create the file which is allocated 1 cluster. 3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit. 4. The ocfs2 volume becomes read-only and dmesg shows: OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks: Inode 66010 has a hole at block 1 File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. So suppress the ERROR message. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>