summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent_io.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2013-05-091-144/+166
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "These are mostly fixes. The biggest exceptions are Josef's skinny extents and Jan Schmidt's code to rebuild our quota indexes if they get out of sync (or you enable quotas on an existing filesystem). The skinny extents are off by default because they are a new variation on the extent allocation tree format. btrfstune -x enables them, and the new format makes the extent allocation tree about 30% smaller. I rebased this a few days ago to rework Dave Sterba's crc checks on the super block, but almost all of these go back to rc6, since I though 3.9 was due any minute. The biggest missing fix is the tracepoint bug that was hit late in 3.9. I ran into problems with that in overnight testing and I'm still tracking it down. I'll definitely have that fixed for rc2." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits) Btrfs: allow superblock mismatch from older mkfs btrfs: enhance superblock checks btrfs: fix misleading variable name for flags btrfs: use unsigned long type for extent state bits Btrfs: improve the loop of scrub_stripe btrfs: read entire device info under lock btrfs: remove unused gfp mask parameter from release_extent_buffer callchain btrfs: handle errors returned from get_tree_block_key btrfs: make static code static & remove dead code Btrfs: deal with errors in write_dev_supers Btrfs: remove almost all of the BUG()'s from tree-log.c Btrfs: deal with free space cache errors while replaying log Btrfs: automatic rescan after "quota enable" command Btrfs: rescan for qgroups Btrfs: split btrfs_qgroup_account_ref into four functions Btrfs: allocate new chunks if the space is not enough for global rsv Btrfs: separate sequence numbers for delayed ref tracking and tree mod log btrfs: move leak debug code to functions Btrfs: return free space in cow error path Btrfs: set UUID in root_item for created trees ...
| * btrfs: use unsigned long type for extent state bitsDavid Sterba2013-05-061-24/+25
| | | | | | | | | | Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs: remove unused gfp mask parameter from release_extent_buffer callchainDavid Sterba2013-05-061-8/+5Star
| | | | | | | | | | | | | | It's unused since 0b32f4bbb423f02ac. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs: make static code static & remove dead codeEric Sandeen2013-05-061-50/+11Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs: move leak debug code to functionsEric Sandeen2013-05-061-55/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up the leak debugging in extent_io.c by moving the debug code into functions. This also removes the list_heads used for debugging from the extent_buffer and extent_state structures when debug is not enabled. Since we need a global debug config to do that last part, implement CONFIG_BTRFS_DEBUG to accommodate. Thanks to Dave Sterba for the Kconfig bit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: cleanup destroy_marked_extentsJosef Bacik2013-05-061-1/+1
| | | | | | | | | | | | | | | | We can just look up the extent_buffers for the range and free stuff that way. This makes the cleanup a bit cleaner and we can make sure to evict the extent_buffers pretty quickly by marking them as stale. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: use REQ_META for all metadata IOJosef Bacik2013-05-061-8/+10
| | | | | | | | | | | | | | We need to tag metadata io with REQ_META to avoid priority inversion when using io throttling cqroups. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: improve the performance of the csums lookupMiao Xie2013-05-061-0/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem by getting the csums in batch. According to the result of the following test, the execute time of __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us). # dd if=<mnt>/file of=/dev/null bs=1M count=1024 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: pass NULL instead of 0Liu Bo2013-05-061-1/+1
| | | | | | | | | | | | | | set_extent_bit()'s (u64 *failed_start) expects NULL not 0. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* | Merge branch 'writeback-workqueue' of ↵Jens Axboe2013-04-021-0/+33
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Btrfs: fix race between mmap writes and compressionChris Mason2013-03-261-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> cc: stable@vger.kernel.org
* | block: Add bio_end_sector()Kent Overstreet2013-03-231-2/+1Star
|/ | | | | | | | | | | | | | | | | | | | Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
* btrfs: fixup/remove module.h usage as requiredPaul Gortmaker2013-03-011-1/+0Star
| | | | | | | | | | We want to avoid module.h where posible, since it in turn includes nearly all of header space. This means removing it where it is not required, and using export.h where we are only exporting symbols via EXPORT_SYMBOL and friends. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* btrfs: use only inline_pages from extent bufferDavid Sterba2013-02-281-15/+6Star
| | | | | | | | | | The nodesize is capped at 64k and there are enough pages preallocated in extent_buffer::inline_pages. The fallback to kmalloc never happened because even on the smallest page size considered (4k) inline_pages covered the needs. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* btrfs: cleanup for open-coded alignmentQu Wenruo2013-02-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Merge branch 'raid56-experimental' into for-linus-3.9Chris Mason2013-02-201-10/+30
|\ | | | | | | | | | | | | | | | | | | Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/ctree.h fs/btrfs/extent-tree.c fs/btrfs/inode.c fs/btrfs/volumes.c
| * Btrfs: reduce lock contention on extent buffer locksChris Mason2013-02-011-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | The extent buffers have a refs_lock which we use to make coordinate freeing the extent buffer with operations on the radix tree. On tree roots and other extent buffers that very cache hot, this can be highly contended. These are also the extent buffers that are basically pinned in memory. This commit adds code to cmpxchg our way through the ref modifications, and as long as the result of the reference change is still pinned in ram, we skip the expensive spinlock. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| * Btrfs: RAID5 and RAID6David Woodhouse2013-02-011-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| * Btrfs: add rw argument to merge_bio_hook()David Woodhouse2013-02-011-3/+3
| | | | | | | | | | | | | | | | We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* | Btrfs: remove unused extent io tree ops V2Josef Bacik2013-02-201-24/+11Star
| | | | | | | | | | | | | | Nobody uses these io tree ops anymore so just remove them and clean up the code a bit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* | Btrfs: use percpu counter for dirty metadata countMiao Xie2013-02-201-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | ->dirty_metadata_bytes is accessed very frequently, so use percpu counter instead of the u64 variant to reduce the contention of the lock. This patch also fixed the problem that we access it without lock protection in __btrfs_btree_balance_dirty(), which may cause we skip the dirty pages flush. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* | Btrfs: use wrapper page_offsetMiao Xie2013-02-201-13/+11Star
|/ | | | | | | | Use wrapper page_offset to get byte-offset into filesystem object for page. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
* Btrfs: handle errors from btrfs_map_bio() everywhereStefan Behrens2012-12-121-4/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the addition of the device replace procedure, it is possible for btrfs_map_bio(READ) to report an error. This happens when the specific mirror is requested which is located on the target disk, and the copy operation has not yet copied this block. Hence the block cannot be read and this error state is indicated by returning EIO. Some background information follows now. A new mirror is added while the device replace procedure is running. btrfs_get_num_copies() returns one more, and btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk location is involved that was already handled by the device replace copy operation. The assigned mirror num is the highest mirror number, e.g. the value 3 in case of RAID1. If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select any mirror), the copy on the target drive is never selected because that disk shall be able to perform the write requests as quickly as possible. The parallel execution of read requests would only slow down the disk copy procedure. Second case is that btrfs_map_bio() is called with mirror_num > 0. This is done from the repair code only. In this case, the highest mirror num is assigned to the target disk, since it is used last. And when this mirror is not available because the copy procedure has not yet handled this area, an error is returned. Everywhere in the code the handling of such errors is added now. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: pass fs_info to btrfs_map_block() instead of mapping_treeStefan Behrens2012-12-121-10/+9Star
| | | | | | | | | This is required for the device replace procedure in a later step. Two calling functions also had to be changed to have the fs_info pointer: repair_io_failure() and scrub_setup_recheck_block(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_treeStefan Behrens2012-12-121-6/+5Star
| | | | | | | This is required for the device replace procedure in a later step. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* fs/btrfs: use WARNJulia Lawall2012-12-121-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-10-261-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has our series of fixes for the next rc. The biggest batch is from Jan Schmidt, fixing up some problems in our subvolume quota code and fixing btrfs send/receive to work with the new extended inode refs." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: do not bug when we fail to commit the transaction Btrfs: fix memory leak when cloning root's node Btrfs: Use btrfs_update_inode_fallback when creating a snapshot Btrfs: Send: preserve ownership (uid and gid) also for symlinks. Btrfs: fix deadlock caused by the nested chunk allocation btrfs: Return EINVAL when length to trim is less than FSB Btrfs: fix memory leak in btrfs_quota_enable() Btrfs: send correct rdev and mode in btrfs-send Btrfs: extended inode refs support for send mechanism Btrfs: Fix wrong error handling code Fix a sign bug causing invalid memory access in the ino_paths ioctl. Btrfs: comment for loop in tree_mod_log_insert_move Btrfs: fix extent buffer reference for tree mod log roots Btrfs: determine level of old roots Btrfs: tree mod log's old roots could still be part of the tree Btrfs: fix a tree mod logging issue for root replacement operations Btrfs: don't put removals from push_node_left into tree mod log twice
| * Btrfs: Fix wrong error handling codeStefan Behrens2012-10-251-2/+2
| | | | | | | | | | | | | | gcc says "warning: comparison of unsigned expression >= 0 is always true" because i is an unsigned long. And gcc is right this time. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
* | Merge branch 'for-linus' of ↵Linus Torvalds2012-10-101-34/+94
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "This is a large pull, with the bulk of the updates coming from: - Hole punching - send/receive fixes - fsync performance - Disk format extension allowing more hardlinks inside a single directory (btrfs-progs patch required to enable the compat bit for this one) I'm cooking more unrelated RAID code, but I wanted to make sure this original batch makes it in. The largest updates here are relatively old and have been in testing for some time." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits) btrfs: init ref_index to zero in add_inode_ref Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer Btrfs: fix page leakage Btrfs: do not warn_on when we cannot alloc a page for an extent buffer Btrfs: don't bug on enomem in readpage Btrfs: cleanup pages properly when ENOMEM in compression Btrfs: make filesystem read-only when submitting barrier fails Btrfs: detect corrupted filesystem after write I/O errors Btrfs: make compress and nodatacow mount options mutually exclusive btrfs: fix message printing Btrfs: don't bother committing delayed inode updates when fsyncing btrfs: move inline function code to header file Btrfs: remove unnecessary IS_ERR in bio_readpage_error() btrfs: remove unused function btrfs_insert_some_items() Btrfs: don't commit instead of overcommitting Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called Btrfs: be smarter about dropping things from the tree log Btrfs: don't lookup csums for prealloc extents Btrfs: cache extent state when writing out dirty metadata pages Btrfs: do not hold the file extent leaf locked when adding extent item ...
| * Btrfs: fix page leakageJosef Bacik2012-10-091-1/+1
| | | | | | | | | | | | | | | | Alloc_dummy_extent_buffer will not free the first page in the eb array if we fail to allocate a page, fix this. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: do not warn_on when we cannot alloc a page for an extent bufferJosef Bacik2012-10-091-3/+1Star
| | | | | | | | | | | | | | | | It's just annoying and the user will have gotten a nice OOM killer message so they are already fully aware they are screwed :). Thanks, Reported-by: Jérôme Poulin <jeromepoulin@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: don't bug on enomem in readpageJosef Bacik2012-10-091-4/+7
| | | | | | | | | | | | | | Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks, Reported-by: Jérôme Poulin <jeromepoulin@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs: move inline function code to header fileRobin Dong2012-10-091-12/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | When building btrfs from kernel code, it will report: fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here because of the wrong declaration of inline functions. Signed-off-by: Robin Dong <sanbai@taobao.com>
| * Btrfs: remove unnecessary IS_ERR in bio_readpage_error()Tsutomu Itoh2012-10-091-1/+1
| | | | | | | | | | | | | | Because the value of extent_map is only a correct value or NULL, so IS_ERR is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
| * Btrfs: cache extent state when writing out dirty metadata pagesJosef Bacik2012-10-091-2/+41
| | | | | | | | | | | | | | | | | | | | Everytime we write out dirty pages we search for an offset in the tree, convert the bits in the state, and then when we wait we search for the offset again and clear the bits. So for every dirty range in the io tree we are doing 4 rb searches, which is suboptimal. With this patch we are only doing 2 searches for every cycle (modulo weird things happening). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: do not async metadata csumming in certain situationsJosef Bacik2012-10-091-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | There are a coule scenarios where farming metadata csumming off to an async thread doesn't help. The first is if our processor supports crc32c, in which case the csumming will be fast and so the overhead of the async model is not worth the cost. The other case is for our tree log. We will be making that stuff dirty and writing it out and waiting for it immediately. Even with software crc32c this gives me a ~15% increase in speed with O_SYNC workloads. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: fix race when getting the eb out of page->privateJosef Bacik2012-10-041-4/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | We can race when checking wether PagePrivate is set on a page and we actually have an eb saved in the pages private pointer. We could have easily written out this page and released it in the time that we did the pagevec lookup and actually got around to looking at this page. So use mapping->private_lock to ensure we get a consistent view of the page->private pointer. This is inline with the alloc and releasepage paths which use private_lock when manipulating page->private. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs: Kill some bi_idx referencesKent Overstreet2012-10-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For immutable bio vecs, I've been auditing and removing bi_idx references. These were harmless, but removing them will make auditing easier. scrub_bio_end_io_worker() was open coding a bio_reset() - but this doesn't appear to have been needed for anything as right after it does a bio_put(), and perusing the code it doesn't appear anything else was holding a reference to the bio. The other use end_bio_extent_readpage() was just for a pr_debug() - changed it to something that might be a bit more useful. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Chris Mason <chris.mason@oracle.com> CC: Stefan Behrens <sbehrens@giantdisaster.de>
| * btrfs: polish names of kmem cachesDavid Sterba2012-10-011-2/+2
| | | | | | | | | | | | | | | | | | | | Usecase: watch 'grep btrfs < /proc/slabinfo' easy to watch all caches in one go. Signed-off-by: David Sterba <dsterba@suse.cz>
| * Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defragLiu Bo2012-10-011-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
| * Btrfs: fix btrfs send for inline items and compressionChris Mason2012-10-011-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* | fs: push rcu_barrier() from deactivate_locked_super() to filesystemsKirill A. Shutemov2012-10-031-0/+6
|/ | | | | | | | | | | | | | | There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-08-291-15/+2Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I've split out the big send/receive update from my last pull request and now have just the fixes in my for-linus branch. The send/recv branch will wander over to linux-next shortly though. The largest patches in this pull are Josef's patches to fix DIO locking problems and his patch to fix a crash during balance. They are both well tested. The rest are smaller fixes that we've had queued. The last rc came out while I was hacking new and exciting ways to recover from a misplaced rm -rf on my dev box, so these missed rc3." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: fix that repair code is spuriously executed for transid failures Btrfs: fix ordered extent leak when failing to start a transaction Btrfs: fix a dio write regression Btrfs: fix deadlock with freeze and sync V2 Btrfs: revert checksum error statistic which can cause a BUG() Btrfs: remove superblock writing after fatal error Btrfs: allow delayed refs to be merged Btrfs: fix enospc problems when deleting a subvol Btrfs: fix wrong mtime and ctime when creating snapshots Btrfs: fix race in run_clustered_refs Btrfs: don't run __tree_mod_log_free_eb on leaves Btrfs: increase the size of the free space cache Btrfs: barrier before waitqueue_active Btrfs: fix deadlock in wait_for_more_refs btrfs: fix second lock in btrfs_delete_delayed_items() Btrfs: don't allocate a seperate csums array for direct reads Btrfs: do not strdup non existent strings Btrfs: do not use missing devices when showing devname Btrfs: fix that error value is changed by mistake Btrfs: lock extents as we map them in DIO ...
| * Btrfs: revert checksum error statistic which can cause a BUG()Stefan Behrens2012-08-281-15/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device statistic counters for detected IO and checksum errors to Linux 3.5. The statistic part that counts checksum errors in end_bio_extent_readpage() can cause a BUG() in a subfunction: "kernel BUG at fs/btrfs/volumes.c:3762!" That part is reverted with the current patch. However, the counting of checksum errors in the scrub context remains active, and the counting of detected IO errors (read, write or flush errors) in all contexts remains active. Cc: stable <stable@vger.kernel.org> # 3.5 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2012-07-261-16/+42
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull large btrfs update from Chris Mason: "This pull request is very large, and the two main features in here have been under testing/devel for quite a while. We have subvolume quotas from the strato developers. This enables full tracking of how many blocks are allocated to each subvolume (and all snapshots) and you can set limits on a per-subvolume basis. You can also create quota groups and toss multiple subvolumes into a big group. It's everything you need to be a web hosting company and give each user their own subvolume. The userland side of the quotas is being refreshed, they'll send out details on where to grab it soon. Next is the kernel side of btrfs send/receive from Alexander Block. This leverages the same infrastructure as the quota code to figure out relationships between blocks and their owners. It can then compute the difference between two snapshots and sends the diffs in a neutral format into userland. The basic model: create a snapshot send that snapshot as the initial backup make changes create a second snapshot send the incremental as a backup delete the first snapshot (use the second snapshot for the next incremental) The receive portion is all in userland, and in the 'next' branch of my btrfs-progs repo. There's still some work to do in terms of optimizing the send side from kernel to userland. The really important part is figuring out how two snapshots are different, and this is where we are concentrating right now. The initial send of a dataset is a little slower than tar, but the incremental sends are dramatically faster than what rsync can do. On top of all of that, we have a nice queue of fixes, cleanups and optimizations." Fix up trivial modify/del conflict in fs/btrfs/ioctl.c Also fix up semantic conflict in fs/btrfs/send.c: the interface to dentry_open() changed in commit 765927b2d508 ("switch dentry_open() to struct path, make it grab references itself"), and since it now grabs whatever references it needs, we should no longer do the mntget() on the mnt (and we need to dput() the dentry reference we took). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits) Btrfs: uninit variable fixes in send/receive Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive Btrfs: add btrfs_compare_trees function Btrfs: introduce subvol uuids and times Btrfs: make iref_to_path non static Btrfs: add a barrier before a waitqueue_active check Btrfs: call the ordered free operation without any locks held Btrfs: Check INCOMPAT flags on remount and add helper function Btrfs: add helper for tree enumeration btrfs: allow cross-subvolume file clone Btrfs: improve multi-thread buffer read Btrfs: make btrfs's allocation smoothly with preallocation Btrfs: lock the transition from dirty to writeback for an eb Btrfs: fix potential race in extent buffer freeing Btrfs: don't return true in releasepage unless we actually freed the eb Btrfs: suppress printk() if all device I/O stats are zero Btrfs: remove unwanted printk() for btrfs device I/O stats Btrfs: rewrite BTRFS_SETGET_FUNCS Btrfs: zero unused bytes in inode item Btrfs: kill free_space pointer from inode structure ... Conflicts: fs/btrfs/ioctl.c
| * Btrfs: improve multi-thread buffer readLiu Bo2012-07-231-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While testing with my buffer read fio jobs[1], I find that btrfs does not perform well enough. Here is a scenario in fio jobs: We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file, and all of them will race on add_to_page_cache_lru(), and if one thread successfully puts its page into the page cache, it takes the responsibility to read the page's data. And what's more, reading a page needs a period of time to finish, in which other threads can slide in and process rest pages: t1 t2 t3 t4 add Page1 read Page1 add Page2 | read Page2 add Page3 | | read Page3 add Page4 | | | read Page4 -----|------------|-----------|-----------|-------- v v v v bio bio bio bio Now we have four bios, each of which holds only one page since we need to maintain consecutive pages in bio. Thus, we can end up with far more bios than we need. Here we're going to a) delay the real read-page section and b) try to put more pages into page cache. With that said, we can make each bio hold more pages and reduce the number of bios we need. Here is some numbers taken from fio results: w/o patch w patch ------------- -------- --------------- READ: 745MB/s +25% 934MB/s [1]: [global] group_reporting thread numjobs=4 bs=32k rw=read ioengine=sync directory=/mnt/btrfs/ [READ] filename=foobar size=2000M invalidate=1 Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: lock the transition from dirty to writeback for an ebJosef Bacik2012-07-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | There is a small window where an eb can have no IO bits set on it, which could potentially result in extent_buffer_under_io() returning false when we want it to return true, which could result in not fun things happening. So in order to protect this case we need to hold the refs_lock when we make this transition to make sure we get reliable results out of extent_buffer_udner_io(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: fix potential race in extent buffer freeingJosef Bacik2012-07-231-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This sounds sort of impossible but it is the only thing I can think of and at the very least it is theoretically possible so here it goes. If we are in try_release_extent_buffer we will check that the ref count on the extent buffer is 1 and not under IO, and then go down and clear the tree ref. If between this check and clearing the tree ref somebody else comes in and grabs a ref on the eb and the marks it dirty before try_release_extent_buffer() does it's tree ref clear we can end up with a dirty eb that will be freed while it is still dirty which will result in a panic. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * Btrfs: don't return true in releasepage unless we actually freed the ebJosef Bacik2012-07-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | I noticed while looking at an extent_buffer race that we will unconditionally return 1 if we get down to release_extent_buffer after clearing the tree ref. However we can easily race in here and get a ref on the eb and not actually free the eb. So make release_extent_buffer return 1 if it free'd the eb and 0 if not so we can be a little kinder to the vm. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * btrfs read error corrected message floods the console during recoveryAnand Jain2012-07-231-1/+1
| | | | | | | | | | | | Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice Signed-off-by: Josef Bacik <jbacik@fusionio.com>