summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-11-1511-64/+83
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: - stable patches to fix NFSv4.x delegation reclaim error paths - fix a bug whereby we were advertising NFSv4.1 but using NFSv4.2 features - fix a use-after-free problem with pNFS block layouts - fix a memory leak in the pNFS files O_DIRECT code - replace an intrusive and Oops-prone performance fix in the NFSv4 atomic open code with a safer one-line version and revert the two original patches" * tag 'nfs-for-3.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: sunrpc: fix sleeping under rcu_read_lock in gss_stringify_acceptor NFS: Don't try to reclaim delegation open state if recovery failed NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revoked NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATE NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired NFS: SEEK is an NFS v4.2 feature nfs: Fix use of uninitialized variable in nfs_getattr() nfs: Remove bogus assignment nfs: remove spurious WARN_ON_ONCE in write path pnfs/blocklayout: serialize GETDEVICEINFO calls nfs: fix pnfs direct write memory leak Revert "NFS: nfs4_do_open should add negative results to the dcache." Revert "NFS: remove BUG possibility in nfs4_open_and_get_state" NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENT
| * NFS: Don't try to reclaim delegation open state if recovery failedTrond Myklebust2014-11-121-0/+2
| | | | | | | | | | | | | | | | | | If state recovery failed, then we should not attempt to reclaim delegated state. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * NFSv4: Ensure that we call FREE_STATEID when NFSv4.x stateids are revokedTrond Myklebust2014-11-122-11/+0Star
| | | | | | | | | | | | | | | | | | NFSv4.x (x>0) requires us to call TEST_STATEID+FREE_STATEID if a stateid is revoked. We will currently fail to do this if the stateid is a delegation. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * NFSv4: Fix races between nfs_remove_bad_delegation() and delegation returnTrond Myklebust2014-11-123-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any attempt to call nfs_remove_bad_delegation() while a delegation is being returned is currently a no-op. This means that we can end up looping forever in nfs_end_delegation_return() if something causes the delegation to be revoked. This patch adds a mechanism whereby the state recovery code can communicate to the delegation return code that the delegation is no longer valid and that it should not be used when reclaiming state. It also changes the return value for nfs4_handle_delegation_recall_error() to ensure that nfs_end_delegation_return() does not reattempt the lock reclaim before state recovery is done. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * NFSv4.1: nfs41_clear_delegation_stateid shouldn't trust NFS_DELEGATED_STATETrond Myklebust2014-11-121-25/+17Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the assumption made previously, that we only need to check the delegation stateid when it matches the stateid on a cached open. If we believe that we hold a delegation for this file, then we must assume that its stateid may have been revoked or expired too. If we don't test it then our state recovery process may end up caching open/lock state in a situation where it should not. We therefore rename the function nfs41_clear_delegation_stateid as nfs41_check_delegation_stateid, and change it to always run through the delegation stateid test and recovery process as outlined in RFC5661. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * NFSv4: Ensure that we remove NFSv4.0 delegations when state has expiredTrond Myklebust2014-11-121-1/+23
| | | | | | | | | | | | | | | | | | | | NFSv4.0 does not have TEST_STATEID/FREE_STATEID functionality, so unlike NFSv4.1, the recovery procedure when stateids have expired or have been revoked requires us to just forget the delegation. http://lkml.kernel.org/r/CAN-5tyHwG=Cn2Q9KsHWadewjpTTy_K26ee+UnSvHvG4192p-Xw@mail.gmail.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * NFS: SEEK is an NFS v4.2 featureAnna Schumaker2014-11-121-3/+3
| | | | | | | | | | | | | | | | | | Somehow the nfs_v4_1_minor_ops had the NFS_CAP_SEEK flag set, enabling SEEK over v4.1. This is wrong, and can make servers crash. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Tested-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * nfs: Fix use of uninitialized variable in nfs_getattr()Jan Kara2014-11-121-1/+1
| | | | | | | | | | | | | | | | | | | | Variable 'err' needn't be initialized when nfs_getattr() uses it to check whether it should call generic_fillattr() or not. That can result in spurious error returns. Initialize 'err' properly. Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * nfs: Remove bogus assignmentJan Kara2014-11-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Commit 3a6fd1f004fc (pnfs/blocklayout: remove read-modify-write handling in bl_write_pagelist) introduced a bogus assignment pg_index = pg_index in variable initialization. AFAICS it's just a typo so remove it. Spotted by Coverity (id 1248711). CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * nfs: remove spurious WARN_ON_ONCE in write pathWeston Andros Adamson2014-11-121-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This WARN_ON_ONCE was supposed to catch reference counting bugs, but can trigger in inappropriate situations. This was reproducible using NFSv2 on an architecture with 64K pages -- we verified that it was not a reference counting bug and the warning was safe to ignore. Reported-by: Will Deacon <will.deacon@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * pnfs/blocklayout: serialize GETDEVICEINFO callsChristoph Hellwig2014-11-122-5/+10
| | | | | | | | | | | | | | | | | | | | The rpc_pipefs code isn't thread safe, leading to occasional use after frees when running xfstests generic/241 (dbench). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/1411740170-18611-2-git-send-email-hch@lst.de Cc: stable@vger.kernel.org # 3.17.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * nfs: fix pnfs direct write memory leakPeng Tao2014-11-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | For pNFS direct writes, layout driver may dynamically allocate ds_cinfo.buckets. So we need to take care to free them when freeing dreq. Ideally this needs to be done inside layout driver where ds_cinfo.buckets are allocated. But buckets are attached to dreq and reused across LD IO iterations. So I feel it's OK to free them in the generic layer. Cc: stable@vger.kernel.org [v3.4+] Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * Revert "NFS: nfs4_do_open should add negative results to the dcache."Trond Myklebust2014-11-051-8/+1Star
| | | | | | | | This reverts commit 4fa2c54b5198d09607a534e2fd436581064587ed.
| * Revert "NFS: remove BUG possibility in nfs4_open_and_get_state"Trond Myklebust2014-11-051-7/+3Star
| | | | | | | | This reverts commit f39c01047994e66e7f3d89ddb4c6141f23349d8d.
| * NFSv4: Ensure nfs_atomic_open set the dentry verifier on ENOENTTrond Myklebust2014-11-051-0/+1
| | | | | | | | | | | | | | | | | | | | If the OPEN rpc call to the server fails with an ENOENT call, nfs_atomic_open will create a negative dentry for that file, however it currently fails to call nfs_set_verifier(), thus causing the dentry to be immediately revalidated on the next call to nfs_lookup_revalidate() instead of following the usual lookup caching rules. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds2014-11-145-25/+67
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "15 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: MAINTAINERS: add IIO include files kernel/panic.c: update comments for print_tainted mem-hotplug: reset node present pages when hot-adding a new pgdat mem-hotplug: reset node managed pages when hot-adding a new pgdat mm/debug-pagealloc: correct freepage accounting and order resetting fanotify: fix notification of groups with inode & mount marks mm, compaction: prevent infinite loop in compact_zone mm: alloc_contig_range: demote pages busy message from warn to info mm/slab: fix unalignment problem on Malta with EVA due to slab merge mm/page_alloc: restrict max order of merging on isolated pageblock mm/page_alloc: move freepage counting logic to __free_one_page() mm/page_alloc: add freepage on isolate pageblock to correct buddy list mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype mm/compaction: skip the range until proper target pageblock is met zram: avoid kunmap_atomic() of a NULL pointer
| * | fanotify: fix notification of groups with inode & mount marksJan Kara2014-11-145-25/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fsnotify() needs to merge inode and mount marks lists when notifying groups about events so that ignore masks from inode marks are reflected in mount mark notifications and groups are notified in proper order (according to priorities). Currently the sorting of the lists done by fsnotify_add_inode_mark() / fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted ignore masks not being used in some cases. Fix the problem by always using the same comparison function when sorting / merging the mark lists. Thanks to Heinrich Schuchardt for improvements of my patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721 Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | ceph: fix flush tid comparisionYan, Zheng2014-11-131-1/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | TID of cap flush ack is 64 bits, but ceph_inode_info::flushing_cap_tid is only 16 bits. 16 bits should be plenty to let the cap flush updates pipeline appropriately, but we need to cast in the proper direction when comparing these differently-sized versions. So downcast the 64-bits one to 16 bits. Reflects ceph.git commit a5184cf46a6e867287e24aeb731634828467cd98. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-11-091-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "It's a one liner for an error cleanup path that leads to crashes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
| * | Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanupChris Mason2014-11-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Erik Berg <btrfs@slipsprogrammoer.no> cc: stable@vger.kernel.org # 3.3 or newer
* | | Merge tag 'xfs-for-linus-3.18-rc3' of ↵Linus Torvalds2014-11-073-196/+142Star
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update fixes a warning in the new pagecache_isize_extended() and updates some related comments, another fix for zero-range misbehaviour, and an unforntuately large set of fixes for regressions in the bulkstat code. The bulkstat fixes are large but necessary. I wouldn't normally push such a rework for a -rcX update, but right now xfsdump can silently create incomplete dumps on 3.17 and it's possible that even xfsrestore won't notice that the dumps were incomplete. Hence we need to get this update into 3.17-stable kernels ASAP. In more detail, the refactoring work I committed in 3.17 has exposed a major hole in our QA coverage. With both xfsdump (the major user of bulkstat) and xfsrestore silently ignoring missing files in the dump/restore process, incomplete dumps were going unnoticed if they were being triggered. Many of the dump/restore filesets were so small that they didn't evenhave a chance of triggering the loop iteration bugs we introduced in 3.17, so we didn't exercise the code sufficiently, either. We have already taken steps to improve QA coverage in xfstests to avoid this happening again, and I've done a lot of manual verification of dump/restore on very large data sets (tens of millions of inodes) of the past week to verify this patch set results in bulkstat behaving the same way as it does on 3.16. Unfortunately, the fixes are not exactly simple - in tracking down the problem historic API warts were discovered (e.g xfsdump has been working around a 20 year old bug in the bulkstat API for the past 10 years) and so that complicated the process of diagnosing and fixing the problems. i.e. we had to fix bugs in the code as well as discover and re-introduce the userspace visible API bugs that we unwittingly "fixed" in 3.17 that xfsdump relied on to work correctly. Summary: - incorrect warnings about i_mutex locking in pagecache_isize_extended() and updates comments to match expected locking - another zero-range bug fix for stray file size updates - a bunch of fixes for regression in the bulkstat code introduced in 3.17" * tag 'xfs-for-linus-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: track bulkstat progress by agino xfs: bulkstat error handling is broken xfs: bulkstat main loop logic is a mess xfs: bulkstat chunk-formatter has issues xfs: bulkstat chunk formatting cursor is broken xfs: bulkstat btree walk doesn't terminate mm: Fix comment before truncate_setsize() xfs: rework zero range to prevent invalid i_size updates mm: Remove false WARN_ON from pagecache_isize_extended() xfs: Check error during inode btree iteration in xfs_bulkstat() xfs: bulkstat doesn't release AGI buffer on error
| * | | xfs: track bulkstat progress by aginoDave Chinner2014-11-061-37/+34Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bulkstat main loop progress is tracked by the "lastino" variable, which is a full 64 bit inode. However, the loop actually works on agno/agino pairs, and so there's a significant disconnect between the rest of the loop and the main cursor. Convert this to use the agino, and pass the agino into the chunk formatting function and convert it too. This gets rid of the inconsistency in the loop processing, and finally makes it simple for us to skip inodes at any point in the loop simply by incrementing the agino cursor. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: bulkstat error handling is brokenDave Chinner2014-11-061-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error propagation is a horror - xfs_bulkstat() returns a rval variable which is only set if there are formatter errors. Any sort of btree walk error or corruption will cause the bulkstat walk to terminate but will not pass an error back to userspace. Worse is the fact that formatter errors will also be ignored if any inodes were correctly formatted into the user buffer. Hence bulkstat can fail badly yet still report success to userspace. This causes significant issues with xfsdump not dumping everything in the filesystem yet reporting success. It's not until a restore fails that there is any indication that the dump was bad and tha bulkstat failed. This patch now triggers xfsdump to fail with bulkstat errors rather than silently missing files in the dump. This now causes bulkstat to fail when the lastino cookie does not fall inside an existing inode chunk. The pre-3.17 code tolerated that error by allowing the code to move to the next inode chunk as the agino target is guaranteed to fall into the next btree record. With the fixes up to this point in the series, xfsdump now passes on the troublesome filesystem image that exposes all these bugs. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: bulkstat main loop logic is a messDave Chinner2014-11-061-32/+24Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a bunch of variables tha tare more wildy scoped than they need to be, obfuscated user buffer checks and tortured "next inode" tracking. This all needs cleaning up to expose the real issues that need fixing. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: bulkstat chunk-formatter has issuesDave Chinner2014-11-061-34/+24Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The loop construct has issues: - clustidx is completely unused, so remove it. - the loop tries to be smart by terminating when the "freecount" tells it that all inodes are free. Just drop it as in most cases we have to scan all inodes in the chunk anyway. - move the "user buffer left" condition check to the only point where we consume space int eh user buffer. - move the initialisation of agino out of the loop, leaving just a simple loop control logic using the clusteridx. Also, double handling of the user buffer variables leads to problems tracking the current state - use the cursor variables directly rather than keeping local copies and then having to update the cursor before returning. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: bulkstat chunk formatting cursor is brokenDave Chinner2014-11-062-47/+28Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfs_bulkstat_agichunk formatting cursor takes buffer values from the main loop and passes them via the structure to the chunk formatter, and the writes the changed values back into the main loop local variables. Unfortunately, this complex dance is full of corner cases that aren't handled correctly. The biggest problem is that it is double handling the information in both the main loop and the chunk formatting function, leading to inconsistent updates and endless loops where progress is not made. To fix this, push the struct xfs_bulkstat_agichunk outwards to be the primary holder of user buffer information. this removes the double handling in the main loop. Also, pass the last inode processed by the chunk formatter as a separate parameter as it purely an output variable and is not related to the user buffer consumption cursor. Finally, the chunk formatting code is not shared by anyone, so make it local to xfs_itable.c. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: bulkstat btree walk doesn't terminateDave Chinner2014-11-061-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bulkstat code has several different ways of detecting the end of an AG when doing a walk. They are not consistently detected, and the code that checks for the end of AG conditions is not consistently coded. Hence the are conditions where the walk code can get stuck in an endless loop making no progress and not triggering any termination conditions. Convert all the "tmp/i" status return codes from btree operations to a common name (stat) and apply end-of-ag detection to these operations consistently. cc: <stable@vger.kernel.org> # 3.17 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: rework zero range to prevent invalid i_size updatesBrian Foster2014-10-301-52/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zero range operation is analogous to fallocate with the exception of converting the range to zeroes. E.g., it attempts to allocate zeroed blocks over the range specified by the caller. The XFS implementation kills all delalloc blocks currently over the aligned range, converts the range to allocated zero blocks (unwritten extents) and handles the partial pages at the ends of the range by sending writes through the pagecache. The current implementation suffers from several problems associated with inode size. If the aligned range covers an extending I/O, said I/O is discarded and an inode size update from a previous write never makes it to disk. Further, if an unaligned zero range extends beyond eof, the page write induced for the partial end page can itself increase the inode size, even if the zero range request is not supposed to update i_size (via KEEP_SIZE, similar to an fallocate beyond EOF). The latter behavior not only incorrectly increases the inode size, but can lead to stray delalloc blocks on the inode. Typically, post-eof preallocation blocks are either truncated on release or inode eviction or explicitly written to by xfs_zero_eof() on natural file size extension. If the inode size increases due to zero range, however, associated blocks leak into the address space having never been converted or mapped to pagecache pages. A direct I/O to such an uncovered range cannot convert the extent via writeback and will BUG(). For example: $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file> ... $ xfs_io -d -c "pread 128k 128k" <file> <BUG> If the entire delalloc extent happens to not have page coverage whatsoever (e.g., delalloc conversion couldn't find a large enough free space extent), even a full file writeback won't convert what's left of the extent and we'll assert on inode eviction. Rework xfs_zero_file_space() to avoid buffered I/O for partial pages. Use the existing hole punch and prealloc mechanisms as primitives for zero range. This implementation is not efficient nor ideal as we writeback dirty data over the range and remove existing extents rather than convert to unwrittern. The former writeback, however, is currently the only mechanism available to ensure consistency between pagecache and extent state. Even a pagecache truncate/delalloc punch prior to hole punch has lead to inconsistencies due to racing with writeback. This provides a consistent, correct implementation of zero range that survives fsstress/fsx testing without assert failures. The implementation can be optimized from this point forward once the fundamental issue of pagecache and delalloc extent state consistency is addressed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: Check error during inode btree iteration in xfs_bulkstat()Jan Kara2014-10-301-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_bulkstat() doesn't check error return from xfs_btree_increment(). In case of specific fs corruption that could result in xfs_bulkstat() entering an infinite loop because we would be looping over the same chunk over and over again. Fix the problem by checking the return value and terminating the loop properly. Coverity-id: 1231338 cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jie Liu <jeff.u.liu@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | xfs: bulkstat doesn't release AGI buffer on errorDave Chinner2014-10-281-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent refactoring of the bulkstat code left a small landmine in the code. If a inobt read fails, then the tree walk is aborted and returns without releasing the AGI buffer or freeing the cursor. This can lead to a subsequent bulkstat call hanging trying to grab the AGI buffer again. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | | | fix breakage in o2net_send_tcp_msg()Al Viro2014-11-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | uninitialized msghdr. Broken in "ocfs2: don't open-code kernel_recvmsg()" by me ;-/ Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | ovl: don't poison cursorMiklos Szeredi2014-11-051-1/+1
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ovl_cache_put() can be called from ovl_dir_reset() if the cache needs to be rebuilt. We did list_del() on the cursor, which results in an Oops on the poisoned pointer in ovl_seek_cursor(). Reported-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Jordi Pujol Palomer <jordipujolp@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2014-11-025-49/+19Star
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "A bunch of assorted fixes, most of them followups to overlayfs merge" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ovl: initialize ->is_cursor Return short read or 0 at end of a raw device, not EIO isofs: don't bother with ->d_op for normal case isofs_cmp(): we'll never see a dentry for . or .. overlayfs: fix lockdep misannotation ovl: fix check for cursor overlayfs: barriers for opening upper-layer directory rcu: Provide counterpart to rcu_dereference() for non-RCU situations staging: android: logger: Fix log corruption regression
| * | | ovl: initialize ->is_cursorMiklos Szeredi2014-10-311-0/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | Return short read or 0 at end of a raw device, not EIODavid Jeffery2014-10-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Author: David Jeffery <djeffery@redhat.com> Changes to the basic direct I/O code have broken the raw driver when reading to the end of a raw device. Instead of returning a short read for a read that extends partially beyond the device's end or 0 when at the end of the device, these reads now return EIO. The raw driver needs the same end of device handling as was added for normal block devices. Using blkdev_read_iter, which has the needed size checks, prevents the EIO conditions at the end of the device. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | isofs: don't bother with ->d_op for normal caseAl Viro2014-10-312-22/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | we only need it for joliet and case-insensitive mounts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | isofs_cmp(): we'll never see a dentry for . or ..Al Viro2014-10-281-18/+2Star
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | overlayfs: fix lockdep misannotationMiklos Szeredi2014-10-282-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | ovl: fix check for cursorMiklos Szeredi2014-10-281-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ovl_cache_entry.name is now an array not a pointer, so it makes no sense test for it being NULL. Detected by coverity. From: Miklos Szeredi <mszeredi@suse.cz> Fixes: 68bf8611076a ("overlayfs: make ovl_cache_entry->name an array instead of +pointer") Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | overlayfs: barriers for opening upper-layer directoryAl Viro2014-10-281-1/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | make sure that a) all stores done by opening struct file don't leak past storing the reference in od->upperfile b) the lockless side has read dependency barrier Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2014-11-015-39/+27Star
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is nailing down some problems with our skinny extent variation, and Dave's patch fixes endian problems in the new super block checks" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items Btrfs: properly clean up btrfs_end_io_wq_cache Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() btrfs: use macro accessors in superblock validation checks
| * | Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent itemsFilipe Manana2014-10-281-8/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns > 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret > 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path->slots[0] is zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: properly clean up btrfs_end_io_wq_cacheJosef Bacik2014-10-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()Filipe Manana2014-10-273-10/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we couldn't find our extent item, we accessed the current slot (path->slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: use macro accessors in superblock validation checksDavid Sterba2014-10-271-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initial patch c926093ec516f5d316 (btrfs: add more superblock checks) did not properly use the macro accessors that wrap endianness and the code would not work correctly on big endian machines. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2014-11-018-28/+51
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "A set of miscellaneous ext4 bug fixes for 3.18" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: make ext4_ext_convert_to_initialized() return proper number of blocks ext4: bail early when clearing inode journal flag fails ext4: bail out from make_indexed_dir() on first error jbd2: use a better hash function for the revoke table ext4: prevent bugon on race between write/fcntl ext4: remove extent status procfs files if journal load fails ext4: disallow changing journal_csum option during remount ext4: enable journal checksum when metadata checksum feature enabled ext4: fix oops when loading block bitmap failed ext4: fix overflow when updating superblock backups after resize
| * | | ext4: make ext4_ext_convert_to_initialized() return proper number of blocksJan Kara2014-10-301-5/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_ext_convert_to_initialized() can return more blocks than are actually allocated from map->m_lblk in case where initial part of the on-disk extent is zeroed out. Luckily this doesn't have serious consequences because the caller currently uses the return value only to unmap metadata buffers. Anyway this is a data corruption/exposure problem waiting to happen so fix it. Coverity-id: 1226848 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: bail early when clearing inode journal flag failsJan Kara2014-10-301-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When clearing inode journal flag, we call jbd2_journal_flush() to force all the journalled data to their final locations. Currently we ignore when this fails and continue clearing inode journal flag. This isn't a big problem because when jbd2_journal_flush() fails, journal is likely aborted anyway. But it can still lead to somewhat confusing results so rather bail out early. Coverity-id: 989044 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: bail out from make_indexed_dir() on first errorJan Kara2014-10-301-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node() fail, there's really something wrong with the fs and there's no point in continuing further. Just return error from make_indexed_dir() in that case. Also initialize frames array so that if we return early due to error, dx_release() doesn't try to dereference uninitialized memory (which could happen also due to error in do_split()). Coverity-id: 741300 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * | | jbd2: use a better hash function for the revoke tableTheodore Ts'o2014-10-301-8/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old hash function didn't work well for 64-bit block numbers, and used undefined (negative) shift right behavior. Use the generic 64-bit hash function instead. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>