summaryrefslogtreecommitdiffstats
path: root/fs/ext4/extents.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2013-02-271-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
| * new helper: file_inode(file)Al Viro2013-02-231-2/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ext4: remove single extent cacheZheng Liu2013-02-181-141/+38Star
| | | | | | | | | | | | | | | | | | Single extent cache could be removed because we have extent status tree as a extent cache, and it would be better. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
* | ext4: lookup block mapping in extent status treeZheng Liu2013-02-181-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After tracking all extent status, we already have a extent cache in memory. Every time we want to lookup a block mapping, we can first try to lookup it in extent status tree to avoid a potential disk I/O. A new function called ext4_es_lookup_extent is defined to finish this work. When we try to lookup a block mapping, we always call ext4_map_blocks and/or ext4_da_map_blocks. So in these functions we first try to lookup a block mapping in extent status tree. A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks in order not to put a hole into extent status tree because this hole will be converted to delayed extent in the tree immediately. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
* | ext4: track all extent status in extent status treeZheng Liu2013-02-181-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By recording the phycisal block and status, extent status tree is able to track the status of every extents. When we call _map_blocks functions to lookup an extent or create a new written/unwritten/delayed extent, this extent will be inserted into extent status tree. We don't load all extents from disk in alloc_inode() because it costs too much memory, and if a file is opened and closed frequently it will takes too much time to load all extent information. So currently when we create/lookup an extent, this extent will be inserted into extent status tree. Hence, the extent status tree may not comprehensively contain all of the extents found in the file. Here a condition we need to take care is that an extent might contains unwritten and delayed status simultaneously because an extent is delayed allocated and could be allocated by fallocate. At this time we need to keep delayed status because later we need to update delayed reservation space using it. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
* | ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flagZheng Liu2013-02-181-1/+5
| | | | | | | | | | | | | | | | | | | | This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag because in later commit ext4_map_blocks needs to use this flag to determine the extent status. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* | ext4: rename and improbe ext4_es_find_extent()Zheng Liu2013-02-181-5/+10
| | | | | | | | | | | | | | | | | | | | | | This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent and improve this function. First, we split input and output parameter. Second, this function never return the first block of the next delayed extent after 'es'. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Jan kara <jack@suse.cz>
* | ext4: refine extent status treeZheng Liu2013-02-181-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit refines the extent status tree code. 1) A prefix 'es_' is added to to the extent status tree structure members. 2) Refactored es_remove_extent() so that __es_remove_extent() can be used by es_insert_extent() to remove the old extent entry(-ies) before inserting a new one. 3) Rename extent_status_end() to ext4_es_end() 4) ext4_es_can_be_merged() is define to check whether two extents can be merged or not. 5) Update and clarified comments. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* | ext4: pass context information to jbd2__journal_start()Theodore Ts'o2013-02-091-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So we can better understand what bits of ext4 are responsible for long-running jbd2 handles, use jbd2__journal_start() so we can pass context information for logging purposes. The recommended way for finding the longer-running handles is: T=/sys/kernel/debug/tracing EVENT=$T/events/jbd2/jbd2_handle_stats echo "interval > 5" > $EVENT/filter echo 1 > $EVENT/enable ./run-my-fs-benchmark cat $T/trace > /tmp/problem-handles This will list handles that were active for longer than 20ms. Having longer-running handles is bad, because a commit started at the wrong time could stall for those 20+ milliseconds, which could delay an fsync() or an O_SYNC operation. Here is an example line from the trace file describing a handle which lived on for 311 jiffies, or over 1.2 seconds: postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32 tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1 dirtied_blocks 0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: remove explicit WARN_ON when ext4_map_blocks() failsLukas Czerner2013-01-291-13/+11Star
| | | | | | | | | | | | | | | | | | | | | | In two places we call WARN_ON() before we print out the debug message, however we agreed that the WARN_ON() is unnecessary at those places so remove them. Also use ext4_warning() instead of ext4_msg() and printk(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: add punching hole support for non-extent-mapped filesZheng Liu2013-01-281-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add supports for indirect file support punching hole. It is almost the same as ext4_ext_punch_hole. First, we invalidate all pages between this hole, and then we try to deallocate all blocks of this hole. A recursive function is used to handle deallocation of blocks. In this function, it iterates over the entries in inode's i_blocks or indirect blocks, and try to free the block for each one of them. After applying this patch, xfstest #255 will not pass w/o extent because indirect-based file doesn't support unwritten extents. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: use unlikely to improve the efficiency of the kernelWang Shilong2013-01-121-3/+3
| | | | | | | | | | | | | | | | Because the function 'sb_getblk' seldomly fails to return NULL value,it will be better to use 'unlikely' to optimize it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | ext4: return ENOMEM if sb_getblk() failsTheodore Ts'o2013-01-121-11/+14
|/ | | | | | | | | | The only reason for sb_getblk() failing is if it can't allocate the buffer_head. So ENOMEM is more appropriate than EIO. In addition, make sure that the file system is marked as being inconsistent if sb_getblk() fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix extent tree corruption caused by hole punchForrest Liu2012-12-171-4/+18
| | | | | | | | | | When depth of extent tree is greater than 1, logical start value of interior node is not correctly updated in ext4_ext_rm_idx. Signed-off-by: Forrest Liu <forrestl@synology.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Ashish Sangwan <ashishsangwan2@gmail.com> Cc: stable@vger.kernel.org
* ext4: remove unused variable from ext4_ext_in_cache()Zhi Yong Wu2012-12-101-2/+0Star
| | | | | | Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Reviewed-by: Zheng Liu <gnehzuil.liu@gmail.com>
* ext4: let fallocate handle inline data correctlyTao Ma2012-12-101-0/+4
| | | | | | | | | If we are punching hole in a file, we will return ENOTSUPP. As for the fallocation of some extents, we will convert the inline data to a normal extent based file first. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: let fiemap work with inline dataTao Ma2012-12-101-0/+9
| | | | | | | | fiemap is used to find the disk layout of a file, as for inline data, let us just pretend like a file with just one extent. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: add normal write support for inline dataTao Ma2012-12-101-1/+8
| | | | | | | | | | For a normal write case (not journalled write, not delayed allocation), we write to the inline if the file is small and convert it to an extent based file when the write is larger than the max inline size. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: rationalize ext4_extents.h inclusionTheodore Ts'o2012-11-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | Previously, ext4_extents.h was being included at the end of ext4.h, which was bad for a number of reasons: (a) it was not being included in the expected place, and (b) it caused the header to be included multiple times. There were #ifdef's to prevent this from causing any problems, but it still was unnecessary. By moving the function declarations that were in ext4_extents.h to ext4.h, which is standard practice for where the function declarations for the rest of ext4.h can be found, we can remove ext4_extents.h from being included in ext4.h at all, and then we can only include ext4_extents.h where it is needed in ext4's source files. It should be possible to move a few more things into ext4.h, and further reduce the number of source files that need to #include ext4_extents.h, but that's a cleanup for another day. Reported-by: Sachin Kamat <sachin.kamat@linaro.org> Reported-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: simple cleanup in fiemap codepathLukas Czerner2012-11-281-16/+16
| | | | | | | | | | This commit is simple cleanup of fiemap codepath which has not been included in previous commit to make the changes clearer. In this commit we rename cbex variable to newex in ext4_fill_fiemap_extents() because callback is no longer present Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: prevent race while walking extent tree for fiemapLukas Czerner2012-11-281-60/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ext4_ext_walk_space() only takes i_data_sem for read when searching for the extent at given block with ext4_ext_find_extent(). Then it drops the lock and the extent tree can be changed at will. However later on we're searching for the 'next' extent, but the extent tree might already have changed, so the information might not be accurate. In fact we can hit BUG_ON(end <= start) if the extent got inserted into the tree after the one we found and before the block we were searching for. This has been reproduced by running xfstests 225 in loop on s390x architecture, but theoretically we could hit this on any other architecture as well, but probably not as often. Moreover the extent currently in delayed allocation might be allocated after we search the extent tree and before we search extent status tree delayed buffers resulting in those delayed buffers being completely missed, even though completely written and allocated. We fix all those problems in several steps: 1. remove unnecessary callback indirection 2. rename functions ext4_ext_walk_space -> ext4_fill_fiemap_extents ext4_ext_fiemap_cb -> ext4_find_delayed_extent 3. move fiemap_fill_next_extent() into ext4_fill_fiemap_extents() 4. hold the i_data_sem for: ext4_ext_find_extent() ext4_ext_next_allocated_block() ext4_find_delayed_extent() 5. call fiemap_fill_next_extent after releasing the i_data_sem 6. move path reinitialization into the critical section. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: reimplement fiemap using extent status treeZheng Liu2012-11-091-163/+21Star
| | | | | | | Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: reimplement ext4_find_delay_alloc_range on extent status treeZheng Liu2012-11-091-99/+18Star
| | | | | | | Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: let ext4 maintain extent status treeZheng Liu2012-11-091-0/+4
| | | | | | | | | | | | | | This patch lets ext4 maintain extent status tree. Currently it only tracks delay extent status in extent status tree. When a delay allocation is issued, the related delay extent will be inserted into extent status tree. When a delay extent is written out or invalidated, it will be removed from this tree. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix missing call to trace_ext4_ext_map_blocks_exitZheng Liu2012-11-081-3/+4
| | | | | | | | | | | | | When ext4_ext_handle_uninitialized_extents(), we will directly return from ext4_ext_map_blocks(). The trace point of trace_ext4_ext_map_blocks_exit isn't called, and the user doesn't see any result. This patch tries to fix this problem. Meanwhile in ext4_ext_handle_uninitialized_extents it returns errors or the number of allocated blocks. So 'ret' variable can be removed due to previously modifications. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
* ext4: print map->m_flags in trace_ext4_ext/ind_map_blocks_exitZheng Liu2012-11-081-2/+1Star
| | | | | | | | | | When we use trace_ext4_ext/ind_map_blocks_exit, print the value of map->m_flags in order that we can understand the extent's current status. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: print 'flags' in ext4_ext_handle_uninitialized_extentsZheng Liu2012-11-081-2/+2
| | | | | | | | | | | In trace_ext4_ext_handle_uninitialized_extents we don't care about the value of map->m_flags because this value is probably 0, and we prefer to get the value of flags because we can know how to handle this extent in this function. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: race-condition protection for ext4_convert_unwritten_extents_endioDmitry Monakhov2012-10-101-11/+46
| | | | | | | | | | | | | | | We assumed that at the time we call ext4_convert_unwritten_extents_endio() extent in question is fully inside [map.m_lblk, map->m_len] because it was already split during submission. But this may not be true due to a race between writeback vs fallocate. If extent in question is larger than requested we will split it again. Special precautions should being done if zeroout required because [map.m_lblk, map->m_len] already contains valid data. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: serialize fallocate with ext4_convert_unwritten_extentsDmitry Monakhov2012-10-051-0/+3
| | | | | | | | | | | | | | | | | | | | | Fallocate should wait for pended ext4_convert_unwritten_extents() otherwise following race may happen: ftruncate( ,12288); fallocate( ,0, 4096) io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */ fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */ Later kwork completion will do: ->ext4_convert_unwritten_extents (0, 4096) ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT); ->ext4_ext_map_blocks() /* Will find new extent: ex = [0,2] !!!!!! */ ->ext4_ext_handle_uninitialized_extents() ->ext4_convert_unwritten_extents_endio() /* convert [0,2] extent to initialized, but only[0,1] was written */ Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix ext4_flush_completed_IO wait semanticsDmitry Monakhov2012-10-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BUG #1) All places where we call ext4_flush_completed_IO are broken because buffered io and DIO/AIO goes through three stages 1) submitted io, 2) completed io (in i_completed_io_list) conversion pended 3) finished io (conversion done) And by calling ext4_flush_completed_IO we will flush only requests which were in (2) stage, which is wrong because: 1) punch_hole and truncate _must_ wait for all outstanding unwritten io regardless to it's state. 2) fsync and nolock_dio_read should also wait because there is a time window between end_page_writeback() and ext4_add_complete_io() As result integrity fsync is broken in case of buffered write to fallocated region: fsync blkdev_completion ->filemap_write_and_wait_range ->ext4_end_bio ->end_page_writeback <-- filemap_write_and_wait_range return ->ext4_flush_completed_IO sees empty i_completed_io_list but pended conversion still exist ->ext4_add_complete_io BUG #2) Race window becomes wider due to the 'ext4: completed_io locking cleanup V4' patch series This patch make following changes: 1) ext4_flush_completed_io() now first try to flush completed io and when wait for any outstanding unwritten io via ext4_unwritten_wait() 2) Rename function to more appropriate name. 3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to prevent endless wait Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: fix ext_remove_space for punch_hole caseDmitry Monakhov2012-10-011-7/+9
| | | | | | | Inode is allowed to have empty leaf only if it this is blockless inode. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: punch_hole should wait for DIO writersDmitry Monakhov2012-10-011-17/+36
| | | | | | | | | | | | | | | | | | | | | | | | punch_hole is the place where we have to wait for all existing writers (writeback, aio, dio), but currently we simply flush pended end_io request which is not sufficient. Other issue is that punch_hole performed w/o i_mutex held which obviously result in dangerous data corruption due to write-after-free. This patch performs following changes: - Guard punch_hole with i_mutex - Recheck inode flags under i_mutex - Block all new dio readers in order to prevent information leak caused by read-after-free pattern. - punch_hole now wait for all writers in flight NOTE: XXX write-after-free race is still possible because new dirty pages may appear due to mmap(), and currently there is no easy way to stop writeback while punch_hole is in progress. [ Fixed error return from ext4_ext_punch_hole() to make sure that we release i_mutex before returning EPERM or ETXTBUSY -- Ted ] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: completed_io locking cleanupDmitry Monakhov2012-09-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current unwritten extent conversion state-machine is very fuzzy. - For unknown reason it performs conversion under i_mutex. What for? My diagnosis: We already protect extent tree with i_data_sem, truncate and punch_hole should wait for DIO, so the only data we have to protect is end_io->flags modification, but only flush_completed_IO and end_io_work modified this flags and we can serialize them via i_completed_io_lock. Currently all these games with mutex_trylock result in the following deadlock truncate: kworker: ext4_setattr ext4_end_io_work mutex_lock(i_mutex) inode_dio_wait(inode) ->BLOCK DEADLOCK<- mutex_trylock() inode_dio_done() #TEST_CASE1_BEGIN MNT=/mnt_scrach unlink $MNT/file fallocate -l $((1024*1024*1024)) $MNT/file aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file sleep 2 truncate -s 0 $MNT/file #TEST_CASE1_END Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286 This patch makes state machine simple and clean: (1) xxx_end_io schedule final extent conversion simply by calling ext4_add_complete_io(), which append it to ei->i_completed_io_list NOTE1: because of (2A) work should be queued only if ->i_completed_io_list was empty, otherwise the work is scheduled already. (2) ext4_flush_completed_IO is responsible for handling all pending end_io from ei->i_completed_io_list Flushing sequence consists of following stages: A) LOCKED: Atomically drain completed_io_list to local_list B) Perform extents conversion C) LOCKED: move converted io's to to_free list for final deletion This logic depends on context which we was called from. D) Final end_io context destruction NOTE1: i_mutex is no longer required because end_io->flags modification is protected by ei->ext4_complete_io_lock Full list of changes: - Move all completion end_io related routines to page-io.c in order to improve logic locality - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io() - remove EXT4_IO_END_FSYNC - Improve SMP scalability by removing useless i_mutex which does not protect io->flags anymore. - Reduce lock contention on i_completed_io_lock by optimizing list walk. - Rename ext4_end_io_nolock to end4_end_io and make it static - Check flush completion status to ext4_ext_punch_hole(). Because it is not good idea to punch blocks from corrupted inode. Changes since V3 (in request to Jan's comments): Fall back to active flush_completed_IO() approach in order to prevent performance issues with nolocked DIO reads. Changes since V2: Fix use-after-free caused by race truncate vs end_io_work Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix unwritten counter leakageDmitry Monakhov2012-09-291-7/+14
| | | | | | | | | | | | | | | | | ext4_set_io_unwritten_flag() will increment i_unwritten counter, so once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back on error path. - add missed error checks to prevent counter leakage - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal that conversion finished. - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future. Visible effect of this bug is that unaligned aio_stress may deadlock Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: ext4_inode_info dietDmitry Monakhov2012-09-291-2/+2
| | | | | | | | | | | | | Generic inode has unused i_private pointer which may be used as cur_aio_dio storage. TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow to have concurent AIO_DIO requests. Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: convert to use leXX_add_cpu()Wei Yongjun2012-09-271-1/+1
| | | | | | | | | | Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: remove unused function ext4_ext_check_cacheLukas Czerner2012-09-271-39/+9Star
| | | | | | | | | Remove unused function ext4_ext_check_cache() and merge the code back to the ext4_ext_in_cache(). Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: speed up truncate/unlink by not using bforget() unless neededAndrey Sidorov2012-09-191-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not iterate over data blocks scanning for bh's to forget as they're never exist. This improves time taken by unlink / truncate syscall. Tested by continuously truncating file that is being written by dd. Another test is rm -rf of linux tree while tar unpacks it. With ordered data mode condition unlikely(!tbh) was always met in ext4_free_blocks. With journal data mode tbh was found only few times, so optimisation is also possible. Unlinking fallocated 60G file after doing sync && echo 3 > /proc/sys/vm/drop_caches && time rm --help X86 before (linux 3.6-rc4): # time rm -f test1 real 0m2.710s user 0m0.000s sys 0m1.530s X86 after: # time rm -f test1 real 0m0.644s user 0m0.003s sys 0m0.060s MIPS before (linux 2.6.37): # time rm -f test1 real 0m 4.93s user 0m 0.00s sys 0m 4.61s MIPS after: # time rm -f test1 real 0m 0.16s user 0m 0.00s sys 0m 0.06s Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
* ext4: fix trivial typo in commentWang Sheng-Hui2012-08-191-1/+1
| | | | | Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: no need to add inode to orphan list during hole punchAshish Sangwan2012-08-191-4/+0Star
| | | | | | | | | | While performing punch hole for an inode, i_disksize is not changed. So, there is no need to add the inode to orphan list. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Acked-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: make the zero-out chunk size tunableZheng Liu2012-08-171-12/+13
| | | | | | | | | | | | | | | | | Currently in ext4 the length of zero-out chunk is set to 7 file system blocks. But if an inode has uninitailized extents from using fallocate to preallocate space, and the workload issues many random writes, this can cause a fragmented extent tree that will unnecessarily grow the extent tree. So create a new sysfs tunable, extent_max_zeroout_kb, which controls the maximum size where blocks will be zeroed out instead of creating a new uninitialized extent. The default of this has been sent to 32kb. CC: Zach Brown <zab@zabbo.net> CC: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: collapse a single extent tree block into the inode if possibleTheodore Ts'o2012-08-171-14/+58
| | | | | | | | | | | | | | | If an inode has more than 4 extents, but then later some of the extents are merged together, we can optimize the file system by moving the extents up into the inode, and discarding the extent tree block. This is important, because if there are a large number of inodes with an external extent tree blocks where the contents could fit in the inode, this can significantly increase the fsck time of the file system. Google-Bug-Id: 6801242 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix kernel BUG on large-scale rm -rf commandsTheodore Ts'o2012-08-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 968dee7722: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit 968dee7722 was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: fix hole punch failure when depth is greater than 0Ashish Sangwan2012-07-231-17/+29
| | | | | | | | | | | | | | | | | | | | | | | Whether to continue removing extents or not is decided by the return value of function ext4_ext_more_to_rm() which checks 2 conditions: a) if there are no more indexes to process. b) if the number of entries are decreased in the header of "depth -1". In case of hole punch, if the last block to be removed is not part of the last extent index than this index will not be deleted, hence the number of valid entries in the extent header of "depth - 1" will remain as it is and ext4_ext_more_to_rm will return 0 although the required blocks are not yet removed. This patch fixes the above mentioned problem as instead of removing the extents from the end of file, it starts removing the blocks from the particular extent from which removing blocks is actually required and continue backward until done. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: stable@vger.kernel.org
* ext4: fix out-of-date comments in extents.cHaiboLiu2012-07-091-2/+1Star
| | | | | | | | | | In this patch, ext4_ext_try_to_merge has been change to merge an extent both left and right. So we need to update the comment in here. Signed-off-by: HaiboLiu <HaiboLiu6@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: honor O_(D)SYNC semantic in ext4_fallocate()Zheng Liu2012-07-011-0/+2
| | | | | | | | | | | | Ext4 must make sure the transaction to be commited to the disk when user opens a file with O_(D)SYNC flag and do a fallocate(2) call. This problem had been reported by Christoph Hellwig in this thread: http://www.spinics.net/lists/linux-btrfs/msg13621.html Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: hole-punch use truncate_pagecache_rangeHugh Dickins2012-06-011-2/+2
| | | | | | | | | | | | | When truncating a file, we unmap pages from userspace first, as that's usually more efficient than relying, page by page, on the fallback in truncate_inode_page() - particularly if the file is mapped many times. Do the same when punching a hole: 3.4 added truncate_pagecache_range() to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead of calling truncate_inode_pages_range() directly. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix format flag in ext4_ext_binsearch_idx()Zheng Liu2012-05-281-1/+1
| | | | | | | fix ext_debug format flag in ext4_ext_binsearch_idx(). Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: verify and calculate checksums for extent tree blocksDarrick J. Wong2012-04-301-0/+50
| | | | | | | | | | Calculate and verify the checksum for each extent tree block. The checksum is located in the space immediately after the last possible ext4_extent in the block. The space is is typically the last 4-8 bytes in the block. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* ext4: create a new BH_Verified flag to avoid unnecessary metadata validationDarrick J. Wong2012-04-301-9/+26
| | | | | | | | | | Create a new BH_Verified flag to indicate that we've verified all the data in a buffer_head for correctness. This allows us to bypass expensive verification steps when they are not necessary without missing them when they are. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>