summaryrefslogtreecommitdiffstats
path: root/fs/ext4/super.c
Commit message (Collapse)AuthorAgeFilesLines
* ext4: track extent status tree shrinker delay staticticsZheng Liu2014-09-021-8/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds some statictics in extent status tree shrinker. The purpose to add these is that we want to collect more details when we encounter a stall caused by extent status tree shrinker. Here we count the following statictics: stats: the number of all objects on all extent status trees the number of reclaimable objects on lru list cache hits/misses the last sorted interval the number of inodes on lru list average: scan time for shrinking some objects the number of shrunk objects maximum: the inode that has max nr. of objects on lru list the maximum scan time for shrinking some objects The output looks like below: $ cat /proc/fs/ext4/sda1/es_shrinker_info stats: 28228 objects 6341 reclaimable objects 5281/631 cache hits/misses 586 ms last sorted interval 250 inodes on lru list average: 153 us scan time 128 shrunk objects maximum: 255 inode (255 objects, 198 reclaimable) 125723 us max scan time If the lru list has never been sorted, the following line will not be printed: 586ms last sorted interval If there is an empty lru list, the following lines also will not be printed: 250 inodes on lru list ... maximum: 255 inode (255 objects, 198 reclaimable) 0 us max scan time Meanwhile in this commit a new trace point is defined to print some details in __ext4_es_shrink(). Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: enable block_validity by defaultDarrick J. Wong2014-09-021-2/+2
| | | | | | | | | Enable by default the block_validity feature, which checks for collisions between newly allocated blocks and critical system metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: convert ext4_bread() to use the ERR_PTR conventionTheodore Ts'o2014-08-301-10/+8Star
| | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong2014-08-291-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: rearrange initialization to fix EXT4FS_DEBUGTheodore Ts'o2014-07-151-49/+39Star
| | | | | | | | | | | | | | | | | | | | The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for ext4 developers only. (You have to modify fs/ext4/ext4.h to enable it.) Rearrange how we initialize data structures to avoid calling ext4_count_free_clusters() until the multiblock allocator has been initialized. This also allows us to only call ext4_count_free_clusters() once, and simplifies the code somewhat. (Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a !CONFIG_SMP compile breakage in the original patch.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
* ext4: revert commit which was causing fs corruption after journal replaysTheodore Ts'o2014-07-111-27/+24Star
| | | | | | | | | | | | | | | | | | | | | | | Commit 007649375f6af2 ("ext4: initialize multi-block allocator before checking block descriptors") causes the block group descriptor's count of the number of free blocks to become inconsistent with the number of free blocks in the allocation bitmap. This is a harmless form of fs corruption, but it causes the kernel to potentially remount the file system read-only, or to panic, depending on the file systems's error behavior. Thanks to Eric Whitney for his tireless work to reproduce and to find the guilty commit. Fixes: 007649375f6af2 ("ext4: initialize multi-block allocator before checking block descriptors" Cc: stable@vger.kernel.org # 3.15 Reported-by: David Jander <david@protonic.nl> Reported-by: Matteo Croce <technoboy85@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Suggested-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen2014-07-061-2/+0Star
| | | | | | | | | | | | | | The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: clarify error count warning messagesTheodore Ts'o2014-07-061-3/+4
| | | | | | | | | | | Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
* ext4: add missing BUFFER_TRACE before ext4_journal_get_write_accessliang xie2014-05-131-0/+1
| | | | | | | Make them more consistently Signed-off-by: xieliang <xieliang@xiaomi.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: remove unnecessary double parenthesesLukas Czerner2014-05-121-1/+1
| | | | | Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: make local functions staticStephen Hemminger2014-05-121-2/+2
| | | | | | | | I have been running make namespacecheck to look for unneeded globals, and found these in ext4. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystemDarrick J. Wong2014-05-121-0/+10
| | | | | | | | | | | On a filesystem with a 1k block size, the group descriptors live in block 2, not block 1. If the filesystem has bigalloc,meta_bg set, however, the calculation of the group descriptor table location does not take this into account and returns the wrong block number. Fix the calculation to return the correct value for this case. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: add a new spinlock i_raw_lock to protect the ext4's raw inodeTheodore Ts'o2014-04-211-0/+1
| | | | | | | | To avoid potential data races, use a spinlock which protects the raw (on-disk) inode. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: rename uninitialized extents to unwrittenLukas Czerner2014-04-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Currently in ext4 there is quite a mess when it comes to naming unwritten extents. Sometimes we call it uninitialized and sometimes we refer to it as unwritten. The right name for the extent which has been allocated but does not contain any written data is _unwritten_. Other file systems are using this name consistently, even the buffer head state refers to it as unwritten. We need to fix this confusion in ext4. This commit changes every reference to an uninitialized extent (meaning allocated but unwritten) to unwritten extent. This includes comments, function names and variable names. It even covers abbreviation of the word uninitialized (such as uninit) and some misspellings. This commit does not change any of the code paths at all. This has been confirmed by comparing md5sums of the assembly code of each object file after all the function names were stripped from it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: initialize multi-block allocator before checking block descriptorsAzat Khuzhin2014-04-071-24/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With EXT4FS_DEBUG ext4_count_free_clusters() will call ext4_read_block_bitmap() without s_group_info initialized, so we need to initialize multi-block allocator before. And dependencies that must be solved, to allow this: - multi-block allocator needs in group descriptors - need to install s_op before initializing multi-block allocator, because in ext4_mb_init_backend() new inode is created. - initialize number of group desc blocks (s_gdb_count) otherwise number of clusters returned by ext4_free_clusters_after_init() is not correct. (see ext4_bg_num_gdb_nometa()) Here is the stack backtrace: (gdb) bt #0 ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430 #1 ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000, desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0, bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358 #2 0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000, block_group=block_group@entry=0, bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476 #3 0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000, block_group=block_group@entry=0) at balloc.c:489 #4 0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665 #5 0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>, sb=0xffff880079a10000) at super.c:2143 #6 ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>, data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851 ... Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: optimize Hurd tests when reading/writing inodesTheodore Ts'o2014-03-241-0/+10
| | | | | | | | | | | Set a in-memory superblock flag to indicate whether the file system is designed to support the Hurd. Also, add a sanity check to make sure the 64-bit feature is not set for Hurd file systems, since i_file_acl_high conflicts with a Hurd-specific field. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: each filesystem creates and uses its own mb_cacheT Makphaibulchoke2014-03-191-8/+17
| | | | | | | | | | | | | | | This patch adds new interfaces to create and destory cache, ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove the cache creation and destory calls from ex4_init_xattr() and ext4_exitxattr() in fs/ext4/xattr.c. fs/ext4/super.c has been changed so that when a filesystem is mounted a cache is allocated and attched to its ext4_sb_info structure. fs/mbcache.c has been changed so that only one slab allocator is allocated and used by all mbcache structures. Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
* ext4: only call sync_filesystm() when remounting read-onlyTheodore Ts'o2014-03-141-2/+3
| | | | | | This is the only time it is required for ext4. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o2014-03-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
* ext4: Add __init marking to init_inodecacheFabian Frederick2014-02-181-1/+1
| | | | | | | init_inodecache is only called by __init init_ext4_fs. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: don't try to modify s_flags if the the file system is read-onlyTheodore Ts'o2014-02-121-7/+13
| | | | | | | | | | | | If an ext4 file system is created by some tool other than mke2fs (perhaps by someone who has a pathalogical fear of the GPL) that doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags, and that file system is then mounted read-only, don't try to modify the s_flags field. Otherwise, if dm_verity is in use, the superblock will change, causing an dm_verity failure. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: Do not reserve clusters when fs doesn't support extentsJan Kara2013-12-091-4/+13
| | | | | | | | | | | | | | | | | | When the filesystem doesn't support extents (like in ext2/3 compatibility modes), there is no need to reserve any clusters. Space estimates for writing are exact, hole punching doesn't need new metadata, and there are no unwritten extents to convert. This fixes a problem when filesystem still having some free space when accessed with a native ext2/3 driver suddently reports ENOSPC when accessed with ext4 driver. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix del_timer() misuse for ->s_err_reportAl Viro2013-12-091-2/+2
| | | | | | | | | | | | | | That thing should be del_timer_sync(); consider what happens if ext4_put_super() call of del_timer() happens to come just as it's getting run on another CPU. Since that timer reschedules itself to run next day, you are pretty much guaranteed that you'll end up with kfree'd scheduled timer, with usual fun consequences. AFAICS, that's -stable fodder all way back to 2010... [the second del_timer_sync() is almost certainly not needed, but it doesn't hurt either] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: use prandom_u32() instead of get_random_bytes()Theodore Ts'o2013-11-081-5/+2Star
| | | | | | | | | Many of the uses of get_random_bytes() do not actually need cryptographically secure random numbers. Replace those uses with a call to prandom_u32(), which is faster and which doesn't consume entropy from the /dev/random driver. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: add ratelimiting to ext4 messagesTheodore Ts'o2013-10-181-58/+94
| | | | | | | | | | | | | | In the case of a storage device that suddenly disappears, or in the case of significant file system corruption, this can result in a huge flood of messages being sent to the console. This can overflow the file system containing /var/log/messages, or if a serial console is configured, this can slow down the system so much that a hardware watchdog can end up triggering forcing a system reboot. Google-Bug-Id: 7258357 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-061-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
| * treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacksJoe Perches2013-08-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | Don't emit OOM warnings when k.alloc calls fail when there there is a v.alloc immediately afterwards. Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-051-16/+0Star
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
| * | direct-io: Implement generic deferred AIO completionsChristoph Hellwig2013-09-041-16/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to the core direct-io code to defer AIO completions to user context using a workqueue. This replaces opencoded and less efficient code in XFS and ext4 (we save a memory allocation for each direct IO) and will be needed to properly support O_(D)SYNC for AIO. The communication between the filesystem and the direct I/O code requires a new buffer head flag, which is a bit ugly but not avoidable until the direct I/O code stops abusing the buffer_head structure for communicating with the filesystems. Currently this creates a per-superblock unbound workqueue for these completions, which is taken from an earlier patch by Jan Kara. I'm not really convinced about this use and would prefer a "normal" global workqueue with a high concurrency limit, but this needs further discussion. JK: Fixed ext4 part, dynamic allocation of the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | ext4: allow specifying external journal by pathname mount optionEric Sandeen2013-08-291-3/+44
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's always been a hassle that if an external journal's device number changes, the filesystem won't mount. And since boot-time enumeration can change, device number changes aren't unusual. The current mechanism to update the journal location is by passing in a mount option w/ a new devnum, but that's a hassle; it's a manual approach, fixing things after the fact. Adding a mount option, "-o journal_path=/dev/$DEVICE" would help, since then we can do i.e. # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ... and it'll mount even if the devnum has changed, as shown here: # losetup /dev/loop0 journalfile # mke2fs -L mylabel-journal -O journal_dev /dev/loop0 # mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1 Change the journal device number: # losetup -d /dev/loop0 # losetup /dev/loop1 journalfile And today it will fail: # mount /dev/sdb1 /mnt/test mount: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so # dmesg | tail -n 1 [17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal But with this new mount option, we can specify the new path: # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test # (which does update the encoded device number, incidentally): # umount /dev/sdb1 # dumpe2fs -h /dev/sdb1 | grep "Journal device" dumpe2fs 1.41.12 (17-May-2010) Journal device: 0x0701 But best of all we can just always mount by journal-path, and it'll always work: # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test # So the journal_path option can be specified in fstab, and as long as the disk is available somewhere, and findable by label (or by UUID), we can mount. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
* | ext4: fix mount/remount error messages for incompatible mount optionsPiotr Sarna2013-08-091-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5688978 ("ext4: improve handling of conflicting mount options") introduced incorrect messages shown while choosing wrong mount options. First of all, both cases of incorrect mount options, "data=journal,delalloc" and "data=journal,dioread_nolock" result in the same error message. Secondly, the problem above isn't solved for remount option: the mismatched parameter is simply ignored. Moreover, ext4_msg states that remount with options "data=journal,delalloc" succeeded, which is not true. To fix it up, I added a simple check after parse_options() call to ensure that data=journal and delalloc/dioread_nolock parameters are not present at the same time. Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: allow the mount options nodelalloc and data=journalTheodore Ts'o2013-08-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Commit 26092bf ("ext4: use a table-driven handler for mount options") wrongly disallows the specifying the mount options nodelalloc and data=journal simultaneously. This is incorrect; it should have only disallowed the combination of delalloc and data=journal simultaneously. Reported-by: Piotr Sarna <p.sarna@partner.samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: destroy ext4_es_cachep on module unloadEric Sandeen2013-07-261-0/+1
|/ | | | | | | | | | Without this, module can't be reloaded. [ 500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org # v3.8+
* ext4: don't show usrquota/grpquota twice in /proc/mountsTheodore Ts'o2013-07-121-6/+0Star
| | | | | | | | | | | | | | We now print mount options in a generic fashion in ext4_show_options(), so we shouldn't be explicitly printing the {usr,grp}quota options in ext4_show_quota_options(). Without this patch, /proc/mounts can look like this: /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0 ^^^^^^^^ ^^^^^^^^ Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix ext4_get_group_number()Theodore Ts'o2013-07-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function ext4_get_group_number() was introduced as an optimization in commit bd86298e60b8. Unfortunately, this commit incorrectly calculate the group number for file systems with a 1k block size (when s_first_data_block is 1 instead of zero). This could cause the following kernel BUG: [ 568.877799] ------------[ cut here ]------------ [ 568.877833] kernel BUG at fs/ext4/mballoc.c:3728! [ 568.877840] Oops: Exception in kernel mode, sig: 5 [#1] [ 568.877845] SMP NR_CPUS=32 NUMA pSeries [ 568.877852] Modules linked in: binfmt_misc [ 568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809f-dirty #1 [ 568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000 [ 568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8 [ 568.877879] REGS: c0000001fa956ed0 TRAP: 0700 Not tainted (3.10.0-03216-g7c6809f-dirty) [ 568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 24000428 XER: 00000000 [ 568.877902] SOFTE: 1 [ 568.877905] CFAR: c0000000002b5464 [ 568.877908] GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000 GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000 GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0 GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672 GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000 GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288 GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10 [ 568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0 [ 568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 [ 568.878004] Call Trace: [ 568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable) [ 568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444 [ 568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734 [ 568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4 [ 568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178 [ 568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490 [ 568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60 [ 568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80 [ 568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0 [ 568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc [ 568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4 [ 568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80 [ 568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c [ 568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90 [ 568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c [ 568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c [ 568.878125] Instruction dump: [ 568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70 [ 568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78 [ 568.878160] ---[ end trace 594d911d9654770b ]--- In addition fix the STD_GROUP optimization so that it works for bigalloc file systems as well. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Li Zhong <lizhongfs@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org # 3.10
* ext4: reduce object size when !CONFIG_PRINTKJoe Perches2013-07-011-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | Reduce the object size ~10% could be useful for embedded systems. Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and arguments, passing " " to functions when !CONFIG_PRINTK and still verifying format and arguments with no_printk. $ size fs/ext4/built-in.o* text data bss dec hex filename 239375 610 888 240873 3ace9 fs/ext4/built-in.o.new 264167 738 888 265793 40e41 fs/ext4/built-in.o.old $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config # CONFIG_PRINTK is not set CONFIG_EXT4_FS=y CONFIG_EXT4_USE_FOR_EXT23=y CONFIG_EXT4_FS_POSIX_ACL=y # CONFIG_EXT4_FS_SECURITY is not set # CONFIG_EXT4_DEBUG is not set Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: improve extent cache shrink mechanism to avoid to burn CPU timeZheng Liu2013-07-011-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: delete unused variablesJon Ernst2013-06-171-2/+0Star
| | | | | | | This patch removed several unused variables. Signed-off-by: Jon Ernst <jonernst07@gmx.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: Fix fsync error handling after filesystem abortDmitry Monakhov2013-06-131-1/+11
| | | | | | | | | | | | | | If filesystem was aborted after inode's write back is complete but before its metadata was updated we may return success results in data loss. In order to handle fs abort correctly we have to check fs state once we discover that it is in MS_RDONLY state Test case: http://patchwork.ozlabs.org/patch/244297 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix data integrity for ext4_sync_fsDmitry Monakhov2013-06-131-1/+35
| | | | | | | | | | | Inode's data or non journaled quota may be written w/o jounral so we _must_ send a barrier at the end of ext4_sync_fs. But it can be skipped if journal commit will do it for us. Also fix data integrity for nojournal mode. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: split extent conversion lists to reserved & unreserved partsJan Kara2013-06-041-11/+27
| | | | | | | | | | | | Now that we have extent conversions with reserved transaction, we have to prevent extent conversions without reserved transaction (from DIO code) to block these (as that would effectively void any transaction reservation we did). So split lists, work items, and work queues to reserved and unreserved parts. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: deprecate max_writeback_mb_bump sysfs attributeJan Kara2013-06-041-6/+24
| | | | | | | | | This attribute is now unused so deprecate it. We still show the old default value to keep some compatibility but we don't allow writing to that attribute anymore. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: suppress ext4 orphan messages on mountPaul Taysom2013-05-281-6/+8
| | | | | | | | | | | | | | | | Suppress the messages releating to processing the ext4 orphan list ("truncating inode" and "deleting unreferenced inode") unless the debug option is on, since otherwise they end up taking up space in the log that could be used for more useful information. Tested by opening several files, unlinking them, then crashing the system, rebooting the system and examining /var/log/messages. Addresses the problem described in http://crbug.com/220976 Signed-off-by: Paul Taysom <taysom@chromium.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: truncate_inode_pages() in orphan cleanup pathLukas Czerner2013-05-281-0/+1
| | | | | | | | | | | | | Currently we do not tell mm to zero out tail of the page before truncate in orphan_cleanup(). This is ok, because the page should not be uptodate, however this may eventually change and I might cause problems. Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara for pointing this out. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* make blkdev_put() return voidAl Viro2013-05-071-7/+4Star
| | | | | | | | same story as with the previous patches - note that return value of blkdev_close() is lost, since there's nowhere the caller (__fput()) could return it to. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-05-021-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
| * procfs: new helper - PDE_DATA(inode)Al Viro2013-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | The only part of proc_dir_entry the code outside of fs/proc really cares about is PDE(inode)->data. Provide a helper for that; static inline for now, eventually will be moved to fs/proc, along with the knowledge of struct proc_dir_entry layout. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ext4: mark all metadata I/O with REQ_METATheodore Ts'o2013-04-201-1/+1
| | | | | | | | | | | | | | | | As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's important that metadata I/O requests are marked as such to avoid priority inversions caused by I/O bandwidth throttling. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: Use kstrtoul() instead of parse_strtoul()Lukas Czerner2013-04-121-18/+9Star
| | | | | | | | | | | | | | | | In parse_strtoul() we're still using deprecated simple_strtoul(). Remove parse_strtoul() altogether and replace it with kstrtoul() Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: fix miscellaneous big endian warningsTheodore Ts'o2013-04-101-3/+3
| | | | | | | | | | | | None of these result in any bug, but they makes sparse complain. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>