summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus-4.10' of ↵Linus Torvalds2017-01-145-15/+38
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are all over the place. The tracepoint part of the pull fixes a crash and adds a little more information to two tracepoints, while the rest are good old fashioned fixes" * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: make tracepoint format strings more compact Btrfs: add truncated_len for ordered extent tracepoints Btrfs: add 'inode' for extent map tracepoint btrfs: fix crash when tracepoint arguments are freed by wq callbacks Btrfs: adjust outstanding_extents counter properly when dio write is split Btrfs: fix lockdep warning about log_mutex Btrfs: use down_read_nested to make lockdep silent btrfs: fix locking when we put back a delayed ref that's too new btrfs: fix error handling when run_delayed_extent_op fails btrfs: return the actual error value from from btrfs_uuid_tree_iterate
| * Merge branch 'tracepoint-updates-4.10' of ↵Chris Mason2017-01-112-5/+12
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.10
| | * Btrfs: add 'inode' for extent map tracepointLiu Bo2017-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | 'inode' is an important field for btrfs_get_extent, lets trace it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * btrfs: fix crash when tracepoint arguments are freed by wq callbacksDavid Sterba2017-01-091-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enabling btrfs tracepoints leads to instant crash, as reported. The wq callbacks could free the memory and the tracepoints started to dereference the members to get to fs_info. The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2 removed the tracepoints but we could preserve them by passing only the required data in a safe way. Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events") CC: stable@vger.kernel.org # 4.8+ Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: adjust outstanding_extents counter properly when dio write is splitLiu Bo2017-01-031-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently how btrfs dio deals with split dio write is not good enough if dio write is split into several segments due to the lack of contiguous space, a large dio write like 'dd bs=1G count=1' can end up with incorrect outstanding_extents counter and endio would complain loudly with an assertion. This fixes the problem by compensating the outstanding_extents counter in inode if a large dio write gets split. Reported-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix lockdep warning about log_mutexLiu Bo2017-01-031-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a different inode's log_mutex with holding the current inode's log_mutex, and lockdep has complained this with a possilble deadlock warning. Fix this by using mutex_lock_nested() when processing the other inode's log_mutex. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: use down_read_nested to make lockdep silentLiu Bo2017-01-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If @block_group is not @used_bg, it'll try to get @used_bg's lock without droping @block_group 's lock and lockdep has throwed a scary deadlock warning about it. Fix it by using down_read_nested. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix locking when we put back a delayed ref that's too newJeff Mahoney2017-01-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In __btrfs_run_delayed_refs, when we put back a delayed ref that's too new, we have already dropped the lock on locked_ref when we set ->processing = 0. This patch keeps the lock to cover that assignment. Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix error handling when run_delayed_extent_op failsJeff Mahoney2017-01-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op fails sets locked_ref->processing = 0 but doesn't re-increment delayed_refs->num_heads_ready. As a result, we end up triggering the WARN_ON in btrfs_select_ref_head. Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads) Reported-by: Jon Nelson <jnelson-suse@jamponi.net> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: return the actual error value from from btrfs_uuid_tree_iteratePan Bian2016-12-191-3/+1Star
| |/ | | | | | | | | | | | | | | | | | | | | | | | | In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret on errors. However, it directly returns 0. It may be better to return ret. This patch also removes the warning, because the caller already prints a warning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731 Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Omar Sandoval <osandov@fb.com> [ edited subject ] Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge uncontroversial parts of branch 'readlink' of ↵Linus Torvalds2016-12-181-1/+0Star
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull partial readlink cleanups from Miklos Szeredi. This is the uncontroversial part of the readlink cleanup patch-set that simplifies the default readlink handling. Miklos and Al are still discussing the rest of the series. * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: vfs: make generic_readlink() static vfs: remove ".readlink = generic_readlink" assignments vfs: default to generic_readlink() vfs: replace calling i_op->readlink with vfs_readlink() proc/self: use generic_readlink ecryptfs: use vfs_get_link() bad_inode: add missing i_op initializers
| * | vfs: remove ".readlink = generic_readlink" assignmentsMiklos Szeredi2016-12-091-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If .readlink == NULL implies generic_readlink(). Generated by: to_del="\.readlink.*=.*generic_readlink" for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2016-12-183-17/+1Star
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "In this pile: - autofs-namespace series - dedupe stuff - more struct path constification" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features ocfs2: charge quota for reflinked blocks ocfs2: fix bad pointer cast ocfs2: always unlock when completing dio writes ocfs2: don't eat io errors during _dio_end_io_write ocfs2: budget for extent tree splits when adding refcount flag ocfs2: prohibit refcounted swapfiles ocfs2: add newlines to some error messages ocfs2: convert inode refcount test to a helper simple_write_end(): don't zero in short copy into uptodate exofs: don't mess with simple_write_{begin,end} 9p: saner ->write_end() on failing copy into non-uptodate page fix gfs2_stuffed_write_end() on short copies fix ceph_write_end() nfs_write_end(): fix handling of short copies vfs: refactor clone/dedupe_file_range common functions fs: try to clone files first in vfs_copy_file_range vfs: misc struct path constification namespace.c: constify struct path passed to a bunch of primitives quota: constify struct path in quota_on ...
| * \ \ Merge remote-tracking branch 'djwong/ocfs2-vfs-reflink-6' into for-linusAl Viro2016-12-163-16/+0Star
| |\ \ \
| | * | | fs: try to clone files first in vfs_copy_file_rangeChristoph Hellwig2016-12-103-16/+0Star
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A clone is a perfectly fine implementation of a file copy, so most file systems just implement the copy that way. Instead of duplicating this logic move it to the VFS. Currently btrfs and XFS implement copies the same way as clones and there is no behavior change for them, cifs only implements clones and grow support for copy_file_range with this patch. NFS implements both, so this will allow copy_file_range to work on servers that only implement CLONE and be lot more efficient on servers that implements CLONE and COPY. Signed-off-by: Christoph Hellwig <hch@lst.de>
| * / / constify btrfs_mksubvol()Al Viro2016-12-061-1/+1
| |/ / | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-linus-4.10' of ↵Linus Torvalds2016-12-1663-4592/+4682
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in here, and Christoph pitched in corrections/improvements to make btrfs use proper helpers for bio walking instead of doing it by hand. There are some key fixes as well, including some long standing bugs that took forever to track down in btrfs_drop_extents and during balance" * 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits) btrfs: limit async_work allocation and worker func duration Revert "Btrfs: adjust len of writes if following a preallocated extent" Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors btrfs: opencode chunk locking, remove helpers btrfs: remove root parameter from transaction commit/end routines btrfs: split btrfs_wait_marked_extents into normal and tree log functions btrfs: take an fs_info directly when the root is not used otherwise btrfs: simplify btrfs_wait_cache_io prototype btrfs: convert extent-tree tracepoints to use fs_info btrfs: root->fs_info cleanup, access fs_info->delayed_root directly btrfs: root->fs_info cleanup, add fs_info convenience variables btrfs: root->fs_info cleanup, update_block_group{,flags} btrfs: root->fs_info cleanup, lock/unlock_chunks btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size btrfs: pull node/sector/stripe sizes out of root and into fs_info btrfs: root->fs_info cleanup, io_ctl_init btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere btrfs: struct reada_control.root -> reada_control.fs_info btrfs: struct btrfsic_state->root should be an fs_info btrfs: alloc_reserved_file_extent trace point should use extent_root ...
| * | btrfs: limit async_work allocation and worker func durationMaxim Patlasov2016-12-133-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0 kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v3.15+
| * | Merge branch 'for-chris-4.10' of ↵Chris Mason2016-12-136-42/+37Star
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.10 Patches queued up by Filipe: The most important change is still the fix for the extent tree corruption that happens due to balance when qgroups are enabled (a regression introduced in 4.7 by a fix for a regression from the last qgroups rework). This has been hitting SLE and openSUSE users and QA very badly, where transactions keep getting aborted when running delayed references leaving the root filesystem in RO mode and nearly unusable. There are fixes here that allow us to run xfstests again with the integrity checker enabled, which has been impossible since 4.8 (apparently I'm the only one running xfstests with the integrity checker enabled, which is useful to validate dirtied leafs, like checking if there are keys out of order, etc). The rest are just some trivial fixes, most of them tagged for stable, and two cleanups. Signed-off-by: Chris Mason <clm@fb.com>
| | * | Btrfs: fix tree search logic when replaying directory entry deletesRobbie Ko2016-11-301-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a log tree has a layout like the following: leaf N: ... item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 dir log end 1275809046 leaf N + 1: item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 dir log end 18446744073709551615 ... When we pass the value 1275809046 + 1 as the parameter start_ret to the function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we end up with path->slots[0] having the value 239 (points to the last item of leaf N, item 240). Because the dir log item in that position has an offset value smaller than *start_ret (1275809046 + 1) we need to move on to the next leaf, however the logic for that is wrong since it compares the current slot to the number of items in the leaf, which is smaller and therefore we don't lookup for the next leaf but instead we set the slot to point to an item that does not exist, at slot 240, and we later operate on that slot which has unexpected content or in the worst case can result in an invalid memory access (accessing beyond the last page of leaf N's extent buffer). So fix the logic that checks when we need to lookup at the next leaf by first incrementing the slot and only after to check if that slot is beyond the last item of the current leaf. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: e02119d5a7b4 (Btrfs: Add a write ahead tree log to optimize synchronous operations) Cc: stable@vger.kernel.org # 2.6.29+ Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
| | * | Btrfs: fix deadlock caused by fsync when logging directory entriesRobbie Ko2016-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While logging new directory entries, at tree-log.c:log_new_dir_dentries(), after we call btrfs_search_forward() we get a leaf with a read lock on it, and without unlocking that leaf we can end up calling btrfs_iget() to get an inode pointer. The later (btrfs_iget()) can end up doing a read-only search on the same tree again, if the inode is not in memory already, which ends up causing a deadlock if some other task in the meanwhile started a write search on the tree and is attempting to write lock the same leaf that btrfs_search_forward() locked while holding write locks on upper levels of the tree blocking the read search from btrfs_iget(). In this scenario we get a deadlock. So fix this by releasing the search path before calling btrfs_iget() at tree-log.c:log_new_dir_dentries(). Example trace of such deadlock: [ 4077.478852] kworker/u24:10 D ffff88107fc90640 0 14431 2 0x00000000 [ 4077.486752] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4077.494346] ffff880ffa56bad0 0000000000000046 0000000000009000 ffff880ffa56bfd8 [ 4077.502629] ffff880ffa56bfd8 ffff881016ce21c0 ffffffffa06ecb26 ffff88101a5d6138 [ 4077.510915] ffff880ebb5173b0 ffff880ffa56baf8 ffff880ebb517410 ffff881016ce21c0 [ 4077.519202] Call Trace: [ 4077.528752] [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4077.536049] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4077.542574] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4077.550171] [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4077.558252] [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4077.566140] [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4077.573928] [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4077.582399] [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4077.589896] [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4077.599632] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4077.607134] [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4077.615329] [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0 [ 4077.622043] [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0 [ 4077.628459] [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270 [ 4077.635759] [<ffffffff81052b0f>] ? kthread+0xaf/0xc0 [ 4077.641404] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4077.648696] [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90 [ 4077.654926] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4078.358087] kworker/u24:15 D ffff88107fcd0640 0 14436 2 0x00000000 [ 4078.365981] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 4078.373574] ffff880ffa57fad0 0000000000000046 0000000000009000 ffff880ffa57ffd8 [ 4078.381864] ffff880ffa57ffd8 ffff88103004d0a0 ffffffffa06ecb26 ffff88101a5d6138 [ 4078.390163] ffff880fbeffc298 ffff880ffa57faf8 ffff880fbeffc2f8 ffff88103004d0a0 [ 4078.398466] Call Trace: [ 4078.408019] [<ffffffffa06ed5ed>] ? btrfs_tree_lock+0xdd/0x2f0 [btrfs] [ 4078.415322] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4078.421844] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4078.429438] [<ffffffffa06a5073>] ? btrfs_lookup_file_extent+0x33/0x40 [btrfs] [ 4078.437518] [<ffffffffa06c600b>] ? __btrfs_drop_extents+0x13b/0xdf0 [btrfs] [ 4078.445404] [<ffffffffa06fc9e2>] ? add_delayed_data_ref+0xe2/0x150 [btrfs] [ 4078.453194] [<ffffffffa06fd629>] ? btrfs_add_delayed_data_ref+0x149/0x1d0 [btrfs] [ 4078.461663] [<ffffffffa06cf3c0>] ? __set_extent_bit+0x4c0/0x5c0 [btrfs] [ 4078.469161] [<ffffffffa06b4a64>] ? insert_reserved_file_extent.constprop.75+0xa4/0x320 [btrfs] [ 4078.478893] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4078.486388] [<ffffffffa06bab57>] ? btrfs_finish_ordered_io+0x2e7/0x600 [btrfs] [ 4078.494561] [<ffffffff8104cbc2>] ? process_one_work+0x142/0x3d0 [ 4078.501278] [<ffffffff8104a507>] ? pwq_activate_delayed_work+0x27/0x40 [ 4078.508673] [<ffffffff8104d729>] ? worker_thread+0x109/0x3b0 [ 4078.515098] [<ffffffff8104d620>] ? manage_workers.isra.26+0x270/0x270 [ 4078.522396] [<ffffffff81052b0f>] ? kthread+0xaf/0xc0 [ 4078.528032] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4078.535325] [<ffffffff814a9ac8>] ? ret_from_fork+0x58/0x90 [ 4078.541552] [<ffffffff81052a60>] ? kthread_create_on_node+0x110/0x110 [ 4079.355824] user-space-program D ffff88107fd30640 0 32020 1 0x00000000 [ 4079.363716] ffff880eae8eba10 0000000000000086 0000000000009000 ffff880eae8ebfd8 [ 4079.372003] ffff880eae8ebfd8 ffff881016c162c0 ffffffffa06ecb26 ffff88101a5d6138 [ 4079.380294] ffff880fbed4b4c8 ffff880eae8eba38 ffff880fbed4b528 ffff881016c162c0 [ 4079.388586] Call Trace: [ 4079.398134] [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs] [ 4079.405431] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4079.411955] [<ffffffffa06876fb>] ? btrfs_lock_root_node+0x2b/0x40 [btrfs] [ 4079.419644] [<ffffffffa068ce83>] ? btrfs_search_slot+0xa03/0xb10 [btrfs] [ 4079.427237] [<ffffffffa06aba52>] ? btrfs_buffer_uptodate+0x52/0x70 [btrfs] [ 4079.435041] [<ffffffffa0689b60>] ? generic_bin_search.constprop.38+0x80/0x190 [btrfs] [ 4079.443897] [<ffffffffa068ea44>] ? btrfs_insert_empty_items+0x74/0xd0 [btrfs] [ 4079.451975] [<ffffffffa072c443>] ? copy_items+0x128/0x850 [btrfs] [ 4079.458890] [<ffffffffa072da10>] ? btrfs_log_inode+0x629/0xbf3 [btrfs] [ 4079.466292] [<ffffffffa06f34a1>] ? btrfs_log_inode_parent+0xc61/0xf30 [btrfs] [ 4079.474373] [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs] [ 4079.482161] [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs] [ 4079.489558] [<ffffffff8112777c>] ? do_fsync+0x4c/0x80 [ 4079.495300] [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10 [ 4079.501422] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b [ 4079.508334] user-space-program D ffff88107fc30640 0 32021 1 0x00000004 [ 4079.516226] ffff880eae8efbf8 0000000000000086 0000000000009000 ffff880eae8effd8 [ 4079.524513] ffff880eae8effd8 ffff881030279610 ffffffffa06ecb26 ffff88101a5d6138 [ 4079.532802] ffff880ebb671d88 ffff880eae8efc20 ffff880ebb671de8 ffff881030279610 [ 4079.541092] Call Trace: [ 4079.550642] [<ffffffffa06ed595>] ? btrfs_tree_lock+0x85/0x2f0 [btrfs] [ 4079.557941] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4079.564463] [<ffffffffa068cc1f>] ? btrfs_search_slot+0x79f/0xb10 [btrfs] [ 4079.572058] [<ffffffffa06bb7d8>] ? btrfs_truncate_inode_items+0x168/0xb90 [btrfs] [ 4079.580526] [<ffffffffa06b04be>] ? join_transaction.isra.15+0x1e/0x3a0 [btrfs] [ 4079.588701] [<ffffffffa06b206d>] ? start_transaction+0x8d/0x470 [btrfs] [ 4079.596196] [<ffffffffa0690ac6>] ? block_rsv_add_bytes+0x16/0x50 [btrfs] [ 4079.603789] [<ffffffffa06bc2e9>] ? btrfs_truncate+0xe9/0x2e0 [btrfs] [ 4079.610994] [<ffffffffa06bd00b>] ? btrfs_setattr+0x30b/0x410 [btrfs] [ 4079.618197] [<ffffffff81117c1c>] ? notify_change+0x1dc/0x680 [ 4079.624625] [<ffffffff8123c8a4>] ? aa_path_perm+0xd4/0x160 [ 4079.630854] [<ffffffff810f4fcb>] ? do_truncate+0x5b/0x90 [ 4079.636889] [<ffffffff810f59fa>] ? do_sys_ftruncate.constprop.15+0x10a/0x160 [ 4079.644869] [<ffffffff8110d87b>] ? SyS_fcntl+0x5b/0x570 [ 4079.650805] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b [ 4080.410607] user-space-program D ffff88107fc70640 0 32028 12639 0x00000004 [ 4080.418489] ffff880eaeccbbe0 0000000000000086 0000000000009000 ffff880eaeccbfd8 [ 4080.426778] ffff880eaeccbfd8 ffff880f317ef1e0 ffffffffa06ecb26 ffff88101a5d6138 [ 4080.435067] ffff880ef7e93928 ffff880f317ef1e0 ffff880eaeccbc08 ffff880f317ef1e0 [ 4080.443353] Call Trace: [ 4080.452920] [<ffffffffa06ed15d>] ? btrfs_tree_read_lock+0xdd/0x190 [btrfs] [ 4080.460703] [<ffffffff81053680>] ? wake_up_atomic_t+0x30/0x30 [ 4080.467225] [<ffffffffa06876bb>] ? btrfs_read_lock_root_node+0x2b/0x40 [btrfs] [ 4080.475400] [<ffffffffa068cc81>] ? btrfs_search_slot+0x801/0xb10 [btrfs] [ 4080.482994] [<ffffffffa06b2df0>] ? btrfs_clean_one_deleted_snapshot+0xe0/0xe0 [btrfs] [ 4080.491857] [<ffffffffa06a70a6>] ? btrfs_lookup_inode+0x26/0x90 [btrfs] [ 4080.499353] [<ffffffff810ec42f>] ? kmem_cache_alloc+0xaf/0xc0 [ 4080.505879] [<ffffffffa06bd905>] ? btrfs_iget+0xd5/0x5d0 [btrfs] [ 4080.512696] [<ffffffffa06caf04>] ? btrfs_get_token_64+0x104/0x120 [btrfs] [ 4080.520387] [<ffffffffa06f341f>] ? btrfs_log_inode_parent+0xbdf/0xf30 [btrfs] [ 4080.528469] [<ffffffffa06f45a9>] ? btrfs_log_dentry_safe+0x59/0x80 [btrfs] [ 4080.536258] [<ffffffffa06c298d>] ? btrfs_sync_file+0x20d/0x330 [btrfs] [ 4080.543657] [<ffffffff8112777c>] ? do_fsync+0x4c/0x80 [ 4080.549399] [<ffffffff81127a0a>] ? SyS_fdatasync+0xa/0x10 [ 4080.555534] [<ffffffff814a9b72>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: 2f2ff0ee5e43 (Btrfs: fix metadata inconsistencies after directory fsync) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
| | * | Btrfs: fix enospc in hole punchingRobbie Ko2016-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hole punching can result in adding new leafs (and as a consequence new nodes) to the tree because when we find file extent items that span beyond the hole range we may end up not deleting them (just adjusting them, reducing their range by reducing their length or increasing their offset field) and add new file extent items representing holes. So after splitting a leaf (therefore creating a new one) to insert a new file extent item representing a hole, a new node might be added to each level of the tree in the worst case scenario (since there's a new key and every parent node was full). For example if a file has an extent item representing the range 0 to 64Mb and we punch a hole in the range 1Mb to 20Mb, the existing extent item is duplicated and one of the copies is adjusted to represent the range 0 to 1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a new file extent item representing a hole in the range 1Mb to 20Mb is inserted. Fix this by using btrfs_calc_trans_metadata_size() instead of btrfs_calc_trunc_metadata_size(), so that enough metadata space is reserved for the worst possible case. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness]
| | * | Btrfs: fix qgroup rescan worker initializationFilipe Manana2016-11-251-4/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were setting the qgroup_rescan_running flag to true only after the rescan worker started (which is a task run by a queue). So if a user space task starts a rescan and immediately after asks to wait for the rescan worker to finish, this second call might happen before the rescan worker task starts running, in which case the rescan wait ioctl returns immediatley, not waiting for the rescan worker to finish. This was making the fstest btrfs/022 fail very often. Fixes: d2c609b834d6 (btrfs: properly track when rescan worker is running) Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
| | * | Btrfs: fix emptiness check for dirtied extent buffers at check_leaf()Filipe Manana2016-11-231-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can not simply use the owner field from an extent buffer's header to get the id of the respective tree when the extent buffer is from a relocation tree. When we create the root for a relocation tree we leave (on purpose) the owner field with the same value as the subvolume's tree root (we do this at ctree.c:btrfs_copy_root()). So we must ignore extent buffers from relocation trees, which have the BTRFS_HEADER_FLAG_RELOC flag set, because otherwise we will always consider the extent buffer as not being the root of the tree (the root of original subvolume tree is always different from the root of the respective relocation tree). This lead to assertion failures when running with the integrity checker enabled (CONFIG_BTRFS_FS_CHECK_INTEGRITY=y) such as the following: [ 643.393409] BTRFS critical (device sdg): corrupt leaf, non-root leaf's nritems is 0: block=38506496, root=260, slot=0 [ 643.397609] BTRFS info (device sdg): leaf 38506496 total ptrs 0 free space 3995 [ 643.407075] assertion failed: 0, file: fs/btrfs/disk-io.c, line: 4078 [ 643.408425] ------------[ cut here ]------------ [ 643.409112] kernel BUG at fs/btrfs/ctree.h:3419! [ 643.409773] invalid opcode: 0000 [#1] PREEMPT SMP [ 643.410447] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq ppdev psmouse acpi_cpufreq parport_pc evdev parport tpm_tis tpm_tis_core pcspkr serio_raw i2c_piix4 sg tpm i2c_core button processor loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod virtio e1000 floppy [ 643.414356] CPU: 11 PID: 32726 Comm: btrfs Not tainted 4.8.0-rc8-btrfs-next-35+ #1 [ 643.414356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 643.414356] task: ffff880145e95b00 task.stack: ffff88014826c000 [ 643.414356] RIP: 0010:[<ffffffffa0352759>] [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs] [ 643.414356] RSP: 0018:ffff88014826fa28 EFLAGS: 00010292 [ 643.414356] RAX: 0000000000000039 RBX: ffff88014e2d7c38 RCX: 0000000000000001 [ 643.414356] RDX: ffff88023f4d2f58 RSI: ffffffff81806c63 RDI: 00000000ffffffff [ 643.414356] RBP: ffff88014826fa28 R08: 0000000000000001 R09: 0000000000000000 [ 643.414356] R10: ffff88014826f918 R11: ffffffff82f3c5ed R12: ffff880172910000 [ 643.414356] R13: ffff880233992230 R14: ffff8801a68a3310 R15: fffffffffffffff8 [ 643.414356] FS: 00007f9ca305e8c0(0000) GS:ffff88023f4c0000(0000) knlGS:0000000000000000 [ 643.414356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 643.414356] CR2: 00007f9ca3071000 CR3: 000000015d01b000 CR4: 00000000000006e0 [ 643.414356] Stack: [ 643.414356] ffff88014826fa50 ffffffffa02d655a 000000000000000a ffff88014e2d7c38 [ 643.414356] 0000000000000000 ffff88014826faa8 ffffffffa02b72f3 ffff88014826fab8 [ 643.414356] 00ffffffa03228e4 0000000000000000 0000000000000000 ffff8801bbd4e000 [ 643.414356] Call Trace: [ 643.414356] [<ffffffffa02d655a>] btrfs_mark_buffer_dirty+0xdf/0xe5 [btrfs] [ 643.414356] [<ffffffffa02b72f3>] btrfs_copy_root+0x18a/0x1d1 [btrfs] [ 643.414356] [<ffffffffa0322921>] create_reloc_root+0x72/0x1ba [btrfs] [ 643.414356] [<ffffffffa03267c2>] btrfs_init_reloc_root+0x7b/0xa7 [btrfs] [ 643.414356] [<ffffffffa02d9e44>] record_root_in_trans+0xdf/0xed [btrfs] [ 643.414356] [<ffffffffa02db04e>] btrfs_record_root_in_trans+0x50/0x6a [btrfs] [ 643.414356] [<ffffffffa030ad2b>] create_subvol+0x472/0x773 [btrfs] [ 643.414356] [<ffffffffa030b406>] btrfs_mksubvol+0x3da/0x463 [btrfs] [ 643.414356] [<ffffffffa030b406>] ? btrfs_mksubvol+0x3da/0x463 [btrfs] [ 643.414356] [<ffffffff810781ac>] ? preempt_count_add+0x65/0x68 [ 643.414356] [<ffffffff811a6e97>] ? __mnt_want_write+0x62/0x77 [ 643.414356] [<ffffffffa030b55d>] btrfs_ioctl_snap_create_transid+0xce/0x187 [btrfs] [ 643.414356] [<ffffffffa030b67d>] btrfs_ioctl_snap_create+0x67/0x81 [btrfs] [ 643.414356] [<ffffffffa030ecfd>] btrfs_ioctl+0x508/0x20dd [btrfs] [ 643.414356] [<ffffffff81293e39>] ? __this_cpu_preempt_check+0x13/0x15 [ 643.414356] [<ffffffff81155eca>] ? handle_mm_fault+0x976/0x9ab [ 643.414356] [<ffffffff81091300>] ? arch_local_irq_save+0x9/0xc [ 643.414356] [<ffffffff8119a2b0>] vfs_ioctl+0x18/0x34 [ 643.414356] [<ffffffff8119a8e8>] do_vfs_ioctl+0x581/0x600 [ 643.414356] [<ffffffff814b9552>] ? entry_SYSCALL_64_fastpath+0x5/0xa8 [ 643.414356] [<ffffffff81093fe9>] ? trace_hardirqs_on_caller+0x17b/0x197 [ 643.414356] [<ffffffff8119a9be>] SyS_ioctl+0x57/0x79 [ 643.414356] [<ffffffff814b9565>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 643.414356] [<ffffffff81091b08>] ? trace_hardirqs_off_caller+0x3f/0xaa [ 643.414356] Code: 89 83 88 00 00 00 31 c0 5b 41 5c 41 5d 5d c3 55 89 f1 48 c7 c2 98 bc 35 a0 48 89 fe 48 c7 c7 05 be 35 a0 48 89 e5 e8 13 46 dd e0 <0f> 0b 55 89 f1 48 c7 c2 9f d3 35 a0 48 89 fe 48 c7 c7 7a d5 35 [ 643.414356] RIP [<ffffffffa0352759>] assfail.constprop.41+0x1c/0x1e [btrfs] [ 643.414356] RSP <ffff88014826fa28> [ 643.468267] ---[ end trace 6a1b3fb1a9d7d6e3 ]--- This can be easily reproduced by running xfstests with the integrity checker enabled. Fixes: 1ba98d086fe3 (Btrfs: detect corruption when non-root leaf has zero item) Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| | * | Btrfs: fix BUG_ON in btrfs_mark_buffer_dirtyLiu Bo2016-11-231-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This can only happen with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y. Commit 1ba98d0 ("Btrfs: detect corruption when non-root leaf has zero item") assumes that a leaf is its root when leaf->bytenr == btrfs_root_bytenr(root), however, we should not use btrfs_root_bytenr(root) since it's mainly got updated during committing transaction. So the check can fail when doing COW on this leaf while it is a root. This changes to use "if (leaf == btrfs_root_node(root))" instead, just like how we check whether leaf is a root in __btrfs_cow_block(). Fixes: 1ba98d086fe3 (Btrfs: detect corruption when non-root leaf has zero item) Cc: stable@vger.kernel.org # 4.8+ Reported-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
| | * | Btrfs: remove rb_node field from the delayed ref node structureFilipe Manana2016-11-191-6/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the last big change in the delayed references code that was needed for the last qgroups rework, the red black tree node field of struct btrfs_delayed_ref_node is no longer used, so just remove it, this helps us save some memory (since struct rb_node is 24 bytes on x86_64) for these structures. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * | Btrfs: remove unused code when creating and merging reloc treesFilipe Manana2016-11-191-19/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 5bc7247ac47c (Btrfs: fix broken nocow after balance) we started abusing the rtransid and otransid fields of root items from relocation trees to fix some issues with nodatacow mode. However later in commit ba8b0289333a (Btrfs: do not reset last_snapshot after relocation) we dropped the code that made use of those fields but did not remove the code that sets those fields. So just remove them to avoid confusion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | Btrfs: fix relocation incorrectly dropping data referencesFilipe Manana2016-11-191-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During relocation of a data block group we create a relocation tree for each fs/subvol tree by making a snapshot of each tree using btrfs_copy_root() and the tree's commit root, and then setting the last snapshot field for the fs/subvol tree's root to the value of the current transaction id minus 1. However this can lead to relocation later dropping references that it did not create if we have qgroups enabled, leaving the filesystem in an inconsistent state that keeps aborting transactions. Lets consider the following example to explain the problem, which requires qgroups to be enabled. We are relocating data block group Y, we have a subvolume with id 258 that has a root at level 1, that subvolume is used to store directory entries for snapshots and we are currently at transaction 3404. When committing transaction 3404, we have a pending snapshot and therefore we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot() in order to create its dentry at subvolume 258. This results in COWing leaf A from root 258 in order to add the dentry. Note that leaf A also contains file extent items referring to extents from some other block group X (we are currently relocating block group Y). Later on, still at create_pending_snapshot() we call qgroup_account_snapshot(), which switches the commit root for root 258 when it calls switch_commit_roots(), so now the COWed version of leaf A, lets call it leaf A', is accessible from the commit root of tree 258. At the end of qgroup_account_snapshot(), we call record_root_in_trans() with 258 as its argument, which results in btrfs_init_reloc_root() being called, which in turn calls relocation.c:create_reloc_root() in order to create a relocation tree associated to root 258, which results in assigning the value of 3403 (which is the current transaction id minus 1 = 3404 - 1) to the last_snapshot field of root 258. When creating the relocation tree root at ctree.c:btrfs_copy_root() we add a shared reference for leaf A', corresponding to the relocation tree's root, when we call btrfs_inc_ref() against the COWed root (a copy of the commit root from tree 258), which is at level 1. So at this point leaf A' has 2 references, one normal reference corresponding to root 258 and one shared reference corresponding to the root of the relocation tree. Transaction 3404 finishes its commit and transaction 3405 is started by relocation when calling merge_reloc_root() for the relocation tree associated to root 258. In the meanwhile leaf A' is COWed again, in response to some filesystem operation, when we are still at transaction 3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we call btrfs_block_can_be_shared() in order to figure out if other trees refer to the leaf and if any such trees exists, add a full back reference to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false because the following condition is false: btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with only one reference, corresponding to the shared reference we created when we called btrfs_copy_root() to create the relocation tree's root and btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not adding shared references for the extents from block group X that leaf A' refers to with its file extent items. Later, after merging the relocation root we do a call to to btrfs_drop_snapshot() in order to delete the relocation tree. This ends up calling do_walk_down() when path->slots[1] points to leaf A', which results in calling btrfs_lookup_extent_info() to get the number of references for leaf A', which is 1 at this time (only the shared reference exists) and this value is stored at wc->refs[0]. After this walk_up_proc() is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'. Because the current level is 0 and wc->refs[0] is 1, it does call btrfs_dec_ref() against leaf A', which results in removing the single references that the extents from block group X have which are associated to root 258 - the expectation was to have each of these extents with 2 references - one reference for root 258 and one shared reference related to the root of the relocation tree, and so we would drop only the shared reference (because leaf A' was supposed to have the flag BTRFS_BLOCK_FLAG_FULL_BACKREF set). This leaves the filesystem in an inconsistent state as we now have file extent items in a subvolume tree that point to extents from block group X without references in the extent tree. So later on when we try to decrement the references for these extents, for example due to a file unlink operation, truncate operation or overwriting ranges of a file, we fail because the expected references do not exist in the extent tree. This leads to warnings and transaction aborts like the following: [ 588.965795] ------------[ cut here ]------------ [ 588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs] [ 588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg [ 588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1 [ 588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [ 588.965845] 0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000 [ 588.965847] 0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000 [ 588.965848] ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48 [ 588.965849] Call Trace: [ 588.965852] [<ffffffff813af542>] dump_stack+0x63/0x81 [ 588.965854] [<ffffffff81081e8b>] __warn+0xcb/0xf0 [ 588.965855] [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20 [ 588.965863] [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs] [ 588.965865] [<ffffffff81143220>] ? trace_clock_local+0x10/0x30 [ 588.965867] [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460 [ 588.965875] [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs] [ 588.965882] [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs] [ 588.965890] [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs] [ 588.965892] [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0 [ 588.965900] [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs] [ 588.965908] [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs] [ 588.965918] [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs] [ 588.965928] [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs] [ 588.965930] [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0 [ 588.965931] [<ffffffff8109b658>] worker_thread+0x48/0x4e0 [ 588.965932] [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0 [ 588.965934] [<ffffffff810a1659>] kthread+0xc9/0xe0 [ 588.965936] [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40 [ 588.965937] [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170 [ 588.965938] ---[ end trace 34e5232c933a1749 ]--- [ 588.966187] ------------[ cut here ]------------ [ 588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs] [ 588.966196] BTRFS: Transaction aborted (error -5) [ 588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg [ 588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G W 4.7.3-3-default-fdm+ #1 [ 588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [ 588.966217] 0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8 [ 588.966219] 0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000 [ 588.966220] ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000 [ 588.966221] Call Trace: [ 588.966223] [<ffffffff813af542>] dump_stack+0x63/0x81 [ 588.966224] [<ffffffff81081e8b>] __warn+0xcb/0xf0 [ 588.966225] [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60 [ 588.966233] [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs] [ 588.966241] [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs] [ 588.966250] [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs] [ 588.966259] [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs] [ 588.966260] [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0 [ 588.966261] [<ffffffff8109b658>] worker_thread+0x48/0x4e0 [ 588.966263] [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0 [ 588.966264] [<ffffffff810a1659>] kthread+0xc9/0xe0 [ 588.966265] [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40 [ 588.966267] [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170 [ 588.966268] ---[ end trace 34e5232c933a174a ]--- [ 588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure [ 588.966270] BTRFS info (device sda2): forced readonly This was happening often on openSUSE and SLE systems using btrfs as the root filesystem (with its default layout where multiple subvolumes are used) where balance happens in the background triggered by a cron job and snapshots are automatically created before/after package installations, upgrades and removals. The issue could be triggered simply by running the following loop on the first system boot post installation: while true; do zypper -n in nfs-kernel-server zypper -n rm nfs-kernel-server done (If we were fast enough and made that loop before the cron job triggered a balance operation and the balance finished) So fix by setting the last_snapshot field of the root to the value of the generation of its commit root. Like this btrfs_block_can_be_shared() behaves correctly for the case where the relocation root is created during a transaction commit and for the case where it's created before a transaction commit. Fixes: 6426c7ad697d (btrfs: qgroup: Fix qgroup accounting when creating snapshot) Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | Merge branch 'for-4.9-rc3' of ↵Chris Mason2016-11-015-9/+29
| | |\ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9
| * | | | Revert "Btrfs: adjust len of writes if following a preallocated extent"Chris Mason2016-12-121-5/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is exposing an existing deadlock between fsync and AIO. Until we have the deadlock fixed, I'm pulling this one out. This reverts commit a23eaa875f0f1d89eb866b8c9860e78273ff5daf. Signed-off-by: Chris Mason <clm@fb.com>
| * | | | Btrfs: don't WARN() in btrfs_transaction_abort() for IO errorsChris Mason2016-12-091-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_transaction_abort() has a WARN() to help us nail down whatever problem lead to the abort. But most of the time, we're aborting for EIO, and the warning just adds noise. Signed-off-by: Chris Mason <clm@fb.com>
| * | | | btrfs: opencode chunk locking, remove helpersDavid Sterba2016-12-065-53/+43Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The helpers are trivial and we don't use them consistently. Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove root parameter from transaction commit/end routinesJeff Mahoney2016-12-0622-211/+188Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: split btrfs_wait_marked_extents into normal and tree log functionsJeff Mahoney2016-12-063-40/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_write_and_wait_marked_extents and btrfs_sync_log both call btrfs_wait_marked_extents, which provides a core loop and then handles errors differently based on whether it's it's a log root or not. This means that btrfs_write_and_wait_marked_extents needs to take a root because btrfs_wait_marked_extents requires one, even though it's only used to determine whether the root is a log root. The log root code won't ever call into the transaction commit code using a log root, so we can factor out the core loop and provide the error handling appropriate to each waiter in new routines. This allows us to eventually remove the root argument from btrfs_commit_transaction, and as a result, btrfs_end_transaction. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: take an fs_info directly when the root is not used otherwiseJeff Mahoney2016-12-0644-1243/+1119Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: simplify btrfs_wait_cache_io prototypeJeff Mahoney2016-12-063-27/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the exception of the one case where btrfs_wait_cache_io is called without a block group, it's called with the same arguments. The root argument is only used in the special case, so let's factor out the core and simplify the call in the normal case to require a trans, block group, and path. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: convert extent-tree tracepoints to use fs_infoJeff Mahoney2016-12-061-9/+8Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent-tree tracepoints all operate on the extent root, regardless of which root is passed in. Let's just use the extent root objectid instead. If it turns out that nobody is depending on the format of this tracepoint, we can drop the root printing entirely. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, access fs_info->delayed_root directlyJeff Mahoney2016-12-064-24/+13Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This results in btrfs_assert_delayed_root_empty and btrfs_destroy_delayed_inode taking an fs_info instead of a root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, add fs_info convenience variablesJeff Mahoney2016-12-0633-2016/+2251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, update_block_group{,flags}Jeff Mahoney2016-12-061-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, lock/unlock_chunksJeff Mahoney2016-12-065-47/+47
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_sizeJeff Mahoney2016-12-069-32/+37
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: pull node/sector/stripe sizes out of root and into fs_infoJeff Mahoney2016-12-0639-414/+432
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, io_ctl_initJeff Mahoney2016-12-062-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The io_ctl->root member was only being used to access root->fs_info. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: root->fs_info cleanup, use fs_info->dev_root everywhereJeff Mahoney2016-12-066-70/+68Star
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: struct reada_control.root -> reada_control.fs_infoJeff Mahoney2016-12-062-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The root is never used. We substitute extent_root in for the reada_find_extent call, since it's only ever used to obtain the node size. This call site will be changed to use fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: struct btrfsic_state->root should be an fs_infoJeff Mahoney2016-12-061-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The root member is never used except for obtaining an fs_info pointer. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: alloc_reserved_file_extent trace point should use extent_rootJeff Mahoney2016-12-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though a separate root is passed in, we're still operating on the extent root. Let's use that for the trace point. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: btrfs_init_new_device should use fs_info->dev_rootJeff Mahoney2016-12-063-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_init_new_device only uses the root passed in via the ioctl to start the transaction. Nothing else that happens is related to whatever root the user used to initiate the ioctl. We can drop the root requirement and just use fs_info->dev_root instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: call functions that always use the same root with fs_info insteadJeff Mahoney2016-12-0613-94/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>