summaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
* ext4: fix copy paste error in ext4_swap_extents()Maninder Singh2017-08-061-1/+1
| | | | | | | | | This bug was found by a static code checker tool for copy paste problems. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Vaneet Narang <v.narang@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix overflow caused by missing cast in ext4_resize_fs()Jerry Lee2017-08-061-1/+2
| | | | | | | | | | | On a 32-bit platform, the value of n_blcoks_count may be wrong during the file system is resized to size larger than 2^32 blocks. This may caused the superblock being corrupted with zero blocks count. Fixes: 1c6bd7173d66 Signed-off-by: Jerry Lee <jerrylee@qnap.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.7+
* ext4, project: expand inode extra size if possibleMiao Xie2017-08-063-24/+85
| | | | | | | | | | | | | When upgrading from old format, try to set project id to old file first time, it will return EOVERFLOW, but if that file is dirtied(touch etc), changing project id will be allowed, this might be confusing for users, we could try to expand @i_extra_isize here too. Reported-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: cleanup ext4_expand_extra_isize_ea()Miao Xie2017-08-061-9/+5Star
| | | | | | | | Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Wang Shilong <wshilong@ddn.com>
* ext4: restructure ext4_expand_extra_isizeMiao Xie2017-08-062-40/+36Star
| | | | | | | | | | | | | Current ext4_expand_extra_isize just tries to expand extra isize, if someone is holding xattr lock or some check fails, it will give up. So rename its name to ext4_try_to_expand_extra_isize. Besides that, we clean up unnecessary check and move some relative checks into it. Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Wang Shilong <wshilong@ddn.com>
* ext4: fix forgetten xattr lock protection in ext4_expand_extra_isizeMiao Xie2017-08-062-12/+16
| | | | | | | | | We should avoid the contention between the i_extra_isize update and the inline data insertion, so move the xattr trylock in front of i_extra_isize update. Signed-off-by: Miao Xie <miaoxie@huawei.com> Reviewed-by: Wang Shilong <wshilong@ddn.com>
* ext4: make xattr inode reads fasterTahsin Erdogan2017-08-064-48/+92
| | | | | | | | | | | | | ext4_xattr_inode_read() currently reads each block sequentially while waiting for io operation to complete before moving on to the next block. This prevents request merging in block layer. Add a ext4_bread_batch() function that starts reads for all blocks then optionally waits for them to complete. A similar logic is used in ext4_find_entry(), so update that code to use the new function. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: inplace xattr block update fails to deduplicate blocksTahsin Erdogan2017-08-061-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an xattr block has a single reference, block is updated inplace and it is reinserted to the cache. Later, a cache lookup is performed to see whether an existing block has the same contents. This cache lookup will most of the time return the just inserted entry so deduplication is not achieved. Running the following test script will produce two xattr blocks which can be observed in "File ACL: " line of debugfs output: mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G mount /dev/sdb /mnt/sdb touch /mnt/sdb/{x,y} setfattr -n user.1 -v aaa /mnt/sdb/x setfattr -n user.2 -v bbb /mnt/sdb/x setfattr -n user.1 -v aaa /mnt/sdb/y setfattr -n user.2 -v bbb /mnt/sdb/y debugfs -R 'stat x' /dev/sdb | cat debugfs -R 'stat y' /dev/sdb | cat This patch defers the reinsertion to the cache so that we can locate other blocks with the same contents. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: remove unused mode parameterTahsin Erdogan2017-08-061-5/+4Star
| | | | | | | ext4_alloc_file_blocks() does not use its mode parameter. Remove it. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix warning about stack corruptionArnd Bergmann2017-08-061-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"), we get a warning about possible stack overflow from a memcpy that was not strictly bounded to the size of the local variable: inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2: include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=] We actually had a bug here that would have been found by the warning, but it was already fixed last year in commit 30a9d7afe70e ("ext4: fix stack memory corruption with 64k block size"). This replaces the fixed-length structure on the stack with a variable-length structure, using the correct upper bound that tells the compiler that everything is really fine here. I also change the loop count to check for the same upper bound for consistency, but the existing code is already correct here. Note that while clang won't allow certain kinds of variable-length arrays in structures, this particular instance is fine, as the array is at the end of the structure, and the size is strictly bounded. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix dir_nlink behaviourAndreas Dilger2017-08-062-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dir_nlink feature has been enabled by default for new ext4 filesystems since e2fsprogs-1.41 in 2008, and was automatically enabled by the kernel for older ext4 filesystems since the dir_nlink feature was added with ext4 in kernel 2.6.28+ when the subdirectory count exceeded EXT4_LINK_MAX-1. Automatically adding the file system features such as dir_nlink is generally frowned upon, since it could cause the file system to not be mountable on older kernel, thus preventing the administrator from rolling back to an older kernel if necessary. In this case, the administrator might also want to disable the feature because glibc's fts_read() function does not correctly optimize directory traversal for directories that use st_nlinks field of 1 to indicate that the number of links in the directory are not tracked by the file system, and could fail to traverse the full directory hierarchy. Fortunately, in the past ten years very few users have complained about incomplete file system traversal by glibc's fts_read(). This commit also changes ext4_inc_count() to allow i_nlinks to reach the full EXT4_LINK_MAX links on the parent directory (including "." and "..") before changing i_links_count to be 1. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405 Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: silence array overflow warningDan Carpenter2017-08-061-1/+1
| | | | | | | | | | | | | | I get a static checker warning: fs/ext4/ext4.h:3091 ext4_set_de_type() error: buffer overflow 'ext4_type_by_mode' 15 <= 15 It seems unlikely that we would hit this read overflow in real life, but it's also simple enough to make the array 16 bytes instead of 15. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesizeJan Kara2017-08-051-0/+3
| | | | | | | | | | | | | | | | | | | | | ext4_find_unwritten_pgoff() does not properly handle a situation when starting index is in the middle of a page and blocksize < pagesize. The following command shows the bug on filesystem with 1k blocksize: xfs_io -f -c "falloc 0 4k" \ -c "pwrite 1k 1k" \ -c "pwrite 3k 1k" \ -c "seek -a -r 0" foo In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048, SEEK_DATA) will return the correct result. Fix the problem by neglecting buffers in a page before starting offset. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> CC: stable@vger.kernel.org # 3.8+
* ext4: release discard bio after sending discard commandsDaeho Jeong2017-08-051-1/+3
| | | | | | | | | | | We've changed the discard command handling into parallel manner. But, in this change, I forgot decreasing the usage count of the bio which was used to send discard request. I'm sorry about that. Fixes: a015434480dc ("ext4: send parallel discards on commit completions") Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: convert swap_inode_data() over to use swap() on most of the fieldsJeff Layton2017-07-311-10/+8Star
| | | | | | | | | | For some odd reason, it forces a byte-by-byte copy of each field. A plain old swap() on most of these fields would be more efficient. We do need to retain the memswap of i_data however as that field is an array. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: error should be cleared if ea_inode isn't added to the cacheEmoly Liu2017-07-311-0/+1
| | | | | | | | | | | | | | | | For Lustre, if ea_inode fails in hash validation but passes parent inode and generation checks, it won't be added to the cache as well as the error "-EFSCORRUPTED" should be cleared, otherwise it will cause "Structure needs cleaning" when running getfattr command. Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9723 Cc: stable@vger.kernel.org Fixes: dec214d00e0d78a08b947d7dccdfdb84407a9f4d Signed-off-by: Emoly Liu <emoly.liu@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: tahsin@google.com
* ext4: Don't clear SGID when inheriting ACLsJan Kara2017-07-311-13/+15
| | | | | | | | | | | | | | | | | | | | When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __ext4_set_acl() into ext4_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
* ext4: preserve i_mode if __ext4_set_acl() failsErnesto A. Fernández2017-07-311-4/+11
| | | | | | | | | | | | | | | | | When changing a file's acl mask, __ext4_set_acl() will first set the group bits of i_mode to the value of the mask, and only then set the actual extended attribute representing the new acl. If the second part fails (due to lack of space, for example) and the file had no acl attribute to begin with, the system will from now on assume that the mask permission bits are actual group permission bits, potentially granting access to the wrong users. Prevent this by only changing the inode mode after the acl has been set. Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: remove unused metadata accounting variablesEric Whitney2017-07-312-6/+2Star
| | | | | | | | | | | | Two variables in ext4_inode_info, i_reserved_meta_blocks and i_allocated_meta_blocks, are unused. Removing them saves a little memory per in-memory inode and cleans up clutter in several tracepoints. Adjust tracepoint output from ext4_alloc_da_blocks() for consistency and fix a typo and whitespace near these changes. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: correct comment references to ext4_ext_direct_IO()Eric Whitney2017-07-311-2/+2
| | | | | | | | | | | Commit 914f82a32d0268847 "ext4: refactor direct IO code" deleted ext4_ext_direct_IO(), but references to that function remain in comments. Update them to refer to ext4_direct_IO_write(). Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2017-07-0920-441/+1994
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The first major feature for ext4 this merge window is the largedir feature, which allows ext4 directories to support over 2 billion directory entries (assuming ~64 byte file names; in practice, users will run into practical performance limits first.) This feature was originally written by the Lustre team, and credit goes to Artem Blagodarenko from Seagate for getting this feature upstream. The second major major feature allows ext4 to support extended attribute values up to 64k. This feature was also originally from Lustre, and has been enhanced by Tahsin Erdogan from Google with a deduplication feature so that if multiple files have the same xattr value (for example, Windows ACL's stored by Samba), only one copy will be stored on disk for encoding and caching efficiency. We also have the usual set of bug fixes, cleanups, and optimizations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits) ext4: fix spelling mistake: "prellocated" -> "preallocated" ext4: fix __ext4_new_inode() journal credits calculation ext4: skip ext4_init_security() and encryption on ea_inodes fs: generic_block_bmap(): initialize all of the fields in the temp bh ext4: change fast symlink test to not rely on i_blocks ext4: require key for truncate(2) of encrypted file ext4: don't bother checking for encryption key in ->mmap() ext4: check return value of kstrtoull correctly in reserved_clusters_store ext4: fix off-by-one fsmap error on 1k block filesystems ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry() ext4: return EIO on read error in ext4_find_entry ext4: forbid encrypting root directory ext4: send parallel discards on commit completions ext4: avoid unnecessary stalls in ext4_evict_inode() ext4: add nombcache mount option ext4: strong binding of xattr inode references ext4: eliminate xattr entry e_hash recalculation for removes ext4: reserve space for xattr entries/names quota: add get_inode_usage callback to transfer multi-inode charges ext4: xattr inode deduplication ...
| * ext4: fix spelling mistake: "prellocated" -> "preallocated"Colin Ian King2017-07-061-1/+1
| | | | | | | | | | | | | | Trivial fix to spelling mistake in mb_debug debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix __ext4_new_inode() journal credits calculationTahsin Erdogan2017-07-065-34/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ea_inode feature allows creating extended attributes that are up to 64k in size. Update __ext4_new_inode() to pick increased credit limits. To avoid overallocating too many journal credits, update __ext4_xattr_set_credits() to make a distinction between xattr create vs update. This helps __ext4_new_inode() because all attributes are known to be new, so we can save credits that are normally needed to delete old values. Also, have fscrypt specify its maximum context size so that we don't end up allocating credits for 64k size. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: skip ext4_init_security() and encryption on ea_inodesTahsin Erdogan2017-07-061-5/+6
| | | | | | | | | | | | | | | | | | | | Extended attribute inodes are internal to ext4. Adding encryption/security related attributes on them would mean dealing with nested calls into ea code. Since they have no direct exposure to user mode, just avoid creating ea entries for them. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: change fast symlink test to not rely on i_blocksTahsin Erdogan2017-07-041-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_inode_info->i_data is the storage area for 4 types of data: a) Extents data b) Inline data c) Block map d) Fast symlink data (symlink length < 60) Extents data case is positively identified by EXT4_INODE_EXTENTS flag. Inline data case is also obvious because of EXT4_INODE_INLINE_DATA flag. Distinguishing c) and d) however requires additional logic. This currently relies on i_blocks count. After subtracting external xattr block from i_blocks, if it is greater than 0 then we know that some data blocks exist, so there must be a block map. This logic got broken after ea_inode feature was added. That feature charges the data blocks of external xattr inodes to the referencing inode and so adds them to the i_blocks. To fix this, we could subtract ea_inode blocks by iterating through all xattr entries and then check whether remaining i_blocks count is zero. Besides being complicated, this won't change the fact that the current way of distinguishing between c) and d) is fragile. The alternative solution is to test whether i_size is less than 60 to determine fast symlink case. ext4_symlink() uses the same test to decide whether to store the symlink in i_data. There is one caveat to address before this can work though. If an inode's i_nlink is zero during eviction, its i_size is set to zero and its data is truncated. If system crashes before inode is removed from the orphan list, next boot orphan cleanup may find the inode with zero i_size. So, a symlink that had its data stored in a block may now appear to be a fast symlink. The solution used in this patch is to treat i_size = 0 as a non-fast symlink case. A zero sized symlink is not legal so the only time this can happen is the mentioned scenario. This is also logically correct because a i_size = 0 symlink has no data stored in i_data. Suggested-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
| * ext4: require key for truncate(2) of encrypted fileEric Biggers2017-06-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, filesystems allow truncate(2) on an encrypted file without the encryption key. However, it's impossible to correctly handle the case where the size being truncated to is not a multiple of the filesystem block size, because that would require decrypting the final block, zeroing the part beyond i_size, then encrypting the block. As other modifications to encrypted file contents are prohibited without the key, just prohibit truncate(2) as well, making it fail with ENOKEY. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: don't bother checking for encryption key in ->mmap()Eric Biggers2017-06-241-7/+0Star
| | | | | | | | | | | | | | | | | | Since only an open file can be mmap'ed, and we only allow open()ing an encrypted file when its key is available, there is no need to check for the key again before permitting each mmap(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: check return value of kstrtoull correctly in reserved_clusters_storeChao Yu2017-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | kstrtoull returns 0 on success, however, in reserved_clusters_store we will return -EINVAL if kstrtoull returns 0, it makes us fail to update reserved_clusters value through sysfs. Fixes: 76d33bca5581b1dd5c3157fa168db849a784ada4 Cc: stable@vger.kernel.org # 4.4 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix off-by-one fsmap error on 1k block filesystemsDarrick J. Wong2017-06-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | For 1k-block filesystems, the filesystem starts at block 1, not block 0. This fact is recorded in s_first_data_block, so use that to bump up the start_fsb before we start querying the filesystem for its space map. Without this, ext4/026 fails on 1k block ext4 because various functions (notably ext4_get_group_no_and_offset) don't know what to do with an fsblock that is "before" the start of the filesystem and return garbage results (blockgroup 2^32-1, etc.) that confuse fsmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()Theodore Ts'o2017-06-231-1/+2
| | | | | | | | | | | | | | Previously a bad directory block with a bad checksum is skipped; we should be returning EFSBADCRC (aka EBADMSG). Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: return EIO on read error in ext4_find_entryKhazhismel Kumykov2017-06-231-2/+2
| | | | | | | | | | | | | | | | Previously, a read error would be ignored and we would eventually return NULL from ext4_find_entry, which signals "no such file or directory". We should be returning EIO. Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
| * ext4: forbid encrypting root directoryEric Biggers2017-06-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently it's possible to encrypt all files and directories on an ext4 filesystem by deleting everything, including lost+found, then setting an encryption policy on the root directory. However, this is incompatible with e2fsck because e2fsck expects to find, create, and/or write to lost+found and does not have access to any encryption keys. Especially problematic is that if e2fsck can't find lost+found, it will create it without regard for whether the root directory is encrypted. This is wrong for obvious reasons, and it causes a later run of e2fsck to consider the lost+found directory entry to be corrupted. Encrypting the root directory may also be of limited use because it is the "all-or-nothing" use case, for which dm-crypt can be used instead. (By design, encryption policies are inherited and cannot be overridden; so the root directory having an encryption policy implies that all files and directories on the filesystem have that same encryption policy.) In any case, encrypting the root directory is broken currently and must not be allowed; so start returning an error if userspace requests it. For now only do this in ext4, because f2fs and ubifs do not appear to have the lost+found requirement. We could move it into fscrypt_ioctl_set_policy() later if desired, though. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
| * ext4: send parallel discards on commit completionsDaeho Jeong2017-06-234-53/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, when we mount ext4 filesystem with '-o discard' option, we have to issue all the discard commands for the blocks to be deallocated and wait for the completion of the commands on the commit complete phase. Because this procedure might involve a lot of sequential combinations of issuing discard commands and waiting for that, the delay of this procedure might be too much long, even to 17.0s in our test, and it results in long commit delay and fsync() performance degradation. To reduce this kind of delay, instead of adding callback for each extent and handling all of them in a sequential manner on commit phase, we instead add a separate list of extents to free to the superblock and then process this list at once after transaction commits so that we can issue all the discard commands in a parallel manner like XFS filesystem. Finally, we could enhance the discard command handling performance. The result was such that 17.0s delay of a single commit in the worst case has been enhanced to 4.8s. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Hobin Woo <hobin.woo@samsung.com> Tested-by: Kitae Lee <kitae87.lee@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz>
| * ext4: avoid unnecessary stalls in ext4_evict_inode()Jan Kara2017-06-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These days inode reclaim calls evict_inode() only when it has no pages in the mapping. In that case it is not necessary to wait for transaction commit in ext4_evict_inode() as there can be no pages waiting to be committed. So avoid unnecessary transaction waiting in that case. We still have to keep the check for the case where ext4_evict_inode() gets called from other paths (e.g. umount) where inode still can have some page cache pages. Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add nombcache mount optionTahsin Erdogan2017-06-223-28/+59
| | | | | | | | | | | | | | | | | | | | | | | | The main purpose of mb cache is to achieve deduplication in extended attributes. In use cases where opportunity for deduplication is unlikely, it only adds overhead. Add a mount option to explicitly turn off mb cache. Suggested-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: strong binding of xattr inode referencesTahsin Erdogan2017-06-221-39/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | To verify that a xattr entry is not pointing to the wrong xattr inode, we currently check that the target inode has EXT4_EA_INODE_FL flag set and also the entry size matches the target inode size. For stronger validation, also incorporate crc32c hash of the value into the e_hash field. This is done regardless of whether the entry lives in the inode body or external attribute block. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: eliminate xattr entry e_hash recalculation for removesTahsin Erdogan2017-06-221-24/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an extended attribute block is modified, ext4_xattr_hash_entry() recalculates e_hash for the entry that is pointed by s->here. This is unnecessary if the modification is to remove an entry. Currently, if the removed entry is the last one and there are other entries remaining, hash calculation targets the just erased entry which has been filled with zeroes and effectively does nothing. If the removed entry is not the last one and there are more entries, this time it will recalculate hash on the next entry which is totally unnecessary. Fix these by moving the decision on when to recalculate hash to ext4_xattr_set_entry(). Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: reserve space for xattr entries/namesTahsin Erdogan2017-06-221-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New ea_inode feature allows putting large xattr values into external inodes. struct ext4_xattr_entry and the attribute name however have to remain in the inode extra space or external attribute block. Once that space is exhausted, no further entries can be added. Some of that space could also be used by values that fit in there at the time of addition. So, a single xattr entry whose value barely fits in the external block could prevent further entries being added. To mitigate the problem, this patch introduces a notion of reserved space in the external attribute block that cannot be used by value data. This reserve is enforced when ea_inode feature is enabled. The amount of reserve is arbitrarily chosen to be min(block_size/8, 1024). The table below shows how much space is reserved for each block size and the guaranteed mininum number of entries that can be placed in the external attribute block. block size reserved bytes entries (name length = 16) 1k 128 3 2k 256 7 4k 512 15 8k 1024 31 16k 1024 31 32k 1024 31 64k 1024 31 Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * quota: add get_inode_usage callback to transfer multi-inode chargesTahsin Erdogan2017-06-225-10/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 ea_inode feature allows storing xattr values in external inodes to be able to store values that are bigger than a block in size. Ext4 also has deduplication support for these type of inodes. With deduplication, the actual storage waste is eliminated but the users of such inodes are still charged full quota for the inodes as if there was no sharing happening in the background. This design requires ext4 to manually charge the users because the inodes are shared. An implication of this is that, if someone calls chown on a file that has such references we need to transfer the quota for the file and xattr inodes. Current dquot_transfer() function implicitly transfers one inode charge. With ea_inode feature, we would like to transfer multiple inode charges. Add get_inode_usage callback which can interrogate the total number of inodes that were charged for a given inode. [ Applied fix from Colin King to make sure the 'ret' variable is initialized on the successful return path. Detected by CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso] Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jan Kara <jack@suse.cz>
| * ext4: xattr inode deduplicationTahsin Erdogan2017-06-226-290/+857
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 now supports xattr values that are up to 64k in size (vfs limit). Large xattr values are stored in external inodes each one holding a single value. Once written the data blocks of these inodes are immutable. The real world use cases are expected to have a lot of value duplication such as inherited acls etc. To reduce data duplication on disk, this patch implements a deduplicator that allows sharing of xattr inodes. The deduplication is based on an in-memory hash lookup that is a best effort sharing scheme. When a xattr inode is read from disk (i.e. getxattr() call), its crc32c hash is added to a hash table. Before creating a new xattr inode for a value being set, the hash table is checked to see if an existing inode holds an identical value. If such an inode is found, the ref count on that inode is incremented. On value removal the ref count is decremented and if it reaches zero the inode is deleted. The quota charging for such inodes is manually managed. Every reference holder is charged the full size as if there was no sharing happening. This is consistent with how xattr blocks are also charged. [ Fixed up journal credits calculation to handle inline data and the rare case where an shared xattr block can get freed when two thread race on breaking the xattr block sharing. --tytso ] Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: cleanup transaction restarts during inode deletionTahsin Erdogan2017-06-223-143/+184
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During inode deletion, the number of journal credits that will be needed is hard to determine. For that reason we have journal extend/restart calls in several places. Whenever a transaction is restarted, filesystem must be in a consistent state because there is no atomicity guarantee beyond a restart call. Add ext4_xattr_ensure_credits() helper function which takes care of journal extend/restart logic. It also handles getting jbd2 write access and dirty metadata calls. This function is called at every iteration of handling an ea_inode reference. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add ext4_is_quota_file()Tahsin Erdogan2017-06-225-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4 also uses it to check whether an inode is for a quota file. The distinction currently doesn't matter because quota is disabled only for the quota files. When we start disabling quota for other inodes in the future, we will want to make the distinction clear. Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where we are checking for quota files. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext2, ext4: make mb block cache names more explicitTahsin Erdogan2017-06-223-44/+47
| | | | | | | | | | | | | | | | | | There will be a second mb_cache instance that tracks ea_inodes. Make existing names more explicit so that it is clear that they refer to xattr block cache. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * mbcache: make mbcache naming more genericTahsin Erdogan2017-06-221-5/+5
| | | | | | | | | | | | | | | | | | | | | | Make names more generic so that mbcache usage is not limited to block sharing. In a subsequent patch in the series ("ext4: xattr inode deduplication"), we start using the mbcache code for sharing xattr inodes. With that patch, old mb_cache_entry.e_block field could be holding either a block number or an inode number. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: move struct ext4_xattr_inode_array to xattr.hTahsin Erdogan2017-06-222-4/+5
| | | | | | | | | | | | | | | | Since this is a xattr specific data structure it is cleaner to keep it in xattr header file. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: modify ext4_xattr_ino_array to hold struct inode *Tahsin Erdogan2017-06-224-59/+53Star
| | | | | | | | | | | | | | | | | | | | Tracking struct inode * rather than the inode number eliminates the repeated ext4_xattr_inode_iget() call later. The second call cannot fail in practice but still requires explanation when it wants to ignore the return value. Avoid the trouble and make things simple. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: improve journal credit handling in set xattr pathsTahsin Erdogan2017-06-225-32/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | Both ext4_set_acl() and ext4_set_context() need to be made aware of ea_inode feature when it comes to credits calculation. Also add a sufficient credits check in ext4_xattr_set_handle() right after xattr write lock is grabbed. Original credits calculation is done outside the lock so there is a possiblity that the initially calculated credits are not sufficient anymore. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: ext4_xattr_delete_inode() should return accurate errorsTahsin Erdogan2017-06-221-6/+10
| | | | | | | | | | | | | | | | In a few places the function returns without trying to pass the actual error code to the caller. Fix those. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: retry storing value in external inode with xattr block tooTahsin Erdogan2017-06-221-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | When value size is <= EXT4_XATTR_MIN_LARGE_EA_SIZE(), and it doesn't fit in either inline or xattr block, a second try is made to store it in an external inode while storing the entry itself in inline area. There should also be an attempt to store the entry in xattr block. This patch adds a retry loop to do that. It also makes the caller the sole decider on whether to store a value in an external inode. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix credits calculation for xattr inodeTahsin Erdogan2017-06-221-2/+1Star
| | | | | | | | | | | | | | | | | | | | When there is no space for a value in xattr block, it may be stored in an xattr inode even if the value length is less than EXT4_XATTR_MIN_LARGE_EA_SIZE(). So the current assumption in credits calculation is wrong. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>