summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/disk-io.c
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: add parent_transid parameter to veirfy_level_keyLiu Bo2018-05-301-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | As verify_level_key() is checked after verify_parent_transid(), i.e. if (verify_parent_transid()) ret = -EIO; else if (verify_level_key()) ret = -EUCLEAN; if parent_transid is 0, verify_parent_transid() skips verifying parent_transid and considers eb as valid, and if verify_level_key() reports something wrong, we're not going to know if it's caused by corrupted metadata or non-checkecd eb (e.g. stale eb). The stale eb can be from an outdated raid1 mirror after a degraded mount, see eg "btrfs: fix reading stale metadata blocks after degraded raid1 mounts" (02a3307aa9c20b4f66262) for more details. @parent_transid is able to tell whether the eb's generation has been verified by the caller. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: get rid of unused orphan infrastructureOmar Sandoval2018-05-281-9/+0Star
| | | | | | | | | | | | | | | | Now that we don't keep long-standing reservations for orphan items, root->orphan_block_rsv isn't used. We can git rid of it, along with: - root->orphan_lock, which was used to protect root->orphan_block_rsv - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations in root->orphan_block_rsv - btrfs_orphan_commit_root(), which was the last user of any of these and does nothing else Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Do super block verification before writing it to diskQu Wenruo2018-05-281-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are already 2 reports about strangely corrupted super blocks, where csum still matches but extra garbage gets slipped into super block. The corruption would looks like: ------ superblock: bytenr=65536, device=/dev/sdc1 --------------------------------------------------------- csum_type 41700 (INVALID) csum 0x3b252d3a [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] ... incompat_flags 0x5b22400000000169 ( MIXED_BACKREF | COMPRESS_LZO | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | unknown flag: 0x5b22400000000000 ) ... ------ Or ------ superblock: bytenr=65536, device=/dev/mapper/x --------------------------------------------------------- csum_type 35355 (INVALID) csum_size 32 csum 0xf0dbeddd [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] ... incompat_flags 0x176d200000000169 ( MIXED_BACKREF | COMPRESS_LZO | BIG_METADATA | EXTENDED_IREF | SKINNY_METADATA | unknown flag: 0x176d200000000000 ) ------ Obviously, csum_type and incompat_flags get some garbage, but its csum still matches, which means kernel calculates the csum based on corrupted super block memory. And after manually fixing these values, the filesystem is completely healthy without any problem exposed by btrfs check. Although the cause is still unknown, at least detect it and prevent further corruption. Both reports have same symptoms, there's an overwrite on offset 192 of the superblock, by 4 bytes. The superblock structure is not allocated or freed and stays in the memory for the whole filesystem lifetime, so it's not a use-after-free kind of error on someone else's leaked page. As a vague point for the problable cause is mentioning of other system freezing related to graphic card drivers. Reported-by: Ken Swenson <flat@imo.uto.moe> Reported-by: Ben Parsons <9parsonsb@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add brief analysis of the reports ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Refactor btrfs_check_super_validQu Wenruo2018-05-281-4/+25
| | | | | | | | | | | | | | | | | | | Refactor btrfs_check_super_valid: 1) Rename it to btrfs_validate_mount_super() Now it's more obvious when the function should be called. 2) Extract core check routine into validate_super() Later write time check can reuse it, and if needed, we could also use validate_super() to check each super block. 3) Add more comments about btrfs_validate_mount_super() Mostly about what it doesn't check and when it should be called. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to validate_super ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Move btrfs_check_super_valid() to avoid forward declarationQu Wenruo2018-05-281-150/+149Star
| | | | | | | | | | | | | Move btrfs_check_super_valid() before its single caller to avoid forward declaration. Though such code motion is not recommended as it pollutes git history, in this case the following patches would need to add new forward declarations for static functions that we want to avoid. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move btrfs_raid_group values to btrfs_raid_attr tableAnand Jain2018-05-281-1/+1
| | | | | | | | | | Add a new member struct btrfs_raid_attr::bg_flag so that btrfs_raid_array can maintain the bit map flag of the raid type, and so we can drop btrfs_raid_group. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: track running balance in a simpler wayDavid Sterba2018-05-281-1/+0Star
| | | | | | | | | | | | | Currently fs_info::balance_running is 0 or 1 and does not use the semantics of atomics. The pause and cancel check for 0, that can happen only after __btrfs_balance exits for whatever reason. Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple times but will block on the balance_mutex that protects the fs_info::flags bit. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: kill btrfs_fs_info::volume_mutexDavid Sterba2018-05-281-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mutual exclusion of device add/rm and balance was done by the volume mutex up to version 3.7. The commit 5ac00addc7ac091109 ("Btrfs: disallow mutually exclusive admin operations from user mode") added a bit that essentially tracked the same information. The status bit has an advantage over a mutex that it can be set without restrictions of function context, so it started to be used in the mount-time resuming of balance or device replace. But we don't really need to track the same information in two ways. 1) After the previous cleanups, the main ioctl handlers for add/del/resize copy the EXCL_OP bit next to the volume mutex, here it's clearly safe. 2) Resuming balance during mount or after rw remount will set only the EXCL_OP bit and the volume_mutex is held in the kernel thread that calls btrfs_balance. 3) Resuming device replace during mount or after rw remount is done after balance and is excluded by the EXCL_OP bit. It does not take the volume_mutex at all and completely relies on the EXCL_OP bit. 4) The resuming of balance and dev-replace cannot hapen at the same time as the ioctls cannot be started in parallel. Nevertheless, a crafted image could trigger that and a warning is printed. 5) Balance is normally excluded by EXCL_OP and also uses own mutex to protect against concurrent access to its status data. There's some trickery to maintain the right lock nesting in case we need to reexamine the status in btrfs_ioctl_balance. The volume_mutex is removed and the unlock/lock sequence is left in place as we might expect other waiters to proceed. 6) Similar to 5, the unlock/lock sequence is kept in btrfs_cancel_balance to allow waiters to continue. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Fix delalloc inodes invalidation during transaction abortNikolay Borisov2018-05-171-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: stable@vger.kernel.org # 4.14.x Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment to igrab ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Commit transaction in advance to reduce early EDQUOTQu Wenruo2018-04-181-0/+1
| | | | | | | | | | | | | | | | | | | Unlike previous method that tries to commit transaction inside qgroup_reserve(), this time we will try to commit transaction using fs_info->transaction_kthread to avoid nested transaction and no need to worry about locking context. Since it's an asynchronous function call and we won't wait for transaction commit, unlike previous method, we must call it before we hit the qgroup limit. So this patch will use the ratio and size of qgroup meta_pertrans reservation as indicator to check if we should trigger a transaction commit. (meta_prealloc won't be cleaned in transaction committ, it's useless anyway) Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Only check first key for committed tree blocksQu Wenruo2018-04-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When looping btrfs/074 with many cpus (>= 8), it's possible to trigger kernel warning due to first key verification: [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.523830] Modules linked in: [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.527101] Call Trace: [ 4239.527251] read_tree_block+0x42/0x70 [ 4239.527434] read_node_slot+0xd2/0x110 [ 4239.527632] push_leaf_right+0xad/0x1b0 [ 4239.527809] split_leaf+0x4ea/0x700 [ 4239.527988] ? leaf_space_used+0xbc/0xe0 [ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0 [ 4239.528416] btrfs_search_slot+0x8cc/0xa40 [ 4239.528605] btrfs_insert_empty_items+0x71/0xc0 [ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680 [ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0 [ 4239.529205] btrfs_commit_transaction+0x33/0xaf0 [ 4239.529445] ? start_transaction+0xa8/0x4f0 [ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0 [ 4239.529833] btrfs_check_data_free_space+0x54/0xa0 [ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70 [ 4239.531907] btrfs_direct_IO+0x233/0x3d0 [ 4239.532098] generic_file_direct_write+0xcb/0x170 [ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4 [ 4239.532491] aio_write+0xe2/0x180 [ 4239.532669] ? lock_acquire+0xac/0x1e0 [ 4239.532839] ? __might_fault+0x3e/0x90 [ 4239.533032] do_io_submit+0x594/0x860 [ 4239.533223] ? do_io_submit+0x594/0x860 [ 4239.533398] SyS_io_submit+0x10/0x20 [ 4239.533560] ? SyS_io_submit+0x10/0x20 [ 4239.533729] do_syscall_64+0x75/0x1d0 [ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 4239.534182] RIP: 0033:0x7f8519741697 The problem here is, at btree_read_extent_buffer_pages() we don't have acquired read/write lock on that extent buffer, only basic info like level/bytenr is reliable. So race condition leads to such false alert. However in current call site, it's impossible to acquire proper lock without race window. To fix the problem, we only verify first key for committed tree blocks (whose generation is no larger than fs_info->last_trans_committed), so the content of such tree blocks will not change and there is no need to get read/write lock. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 581c1760415c ("btrfs: Validate child tree block's level and first key") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba2018-04-121-14/+1Star
| | | | | | | | Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: clean up resources during umount after trans is abortedLiu Bo2018-04-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | Currently if some fatal errors occur, like all IO get -EIO, resources would be cleaned up when a) transaction is being committed or b) BTRFS_FS_STATE_ERROR is set However, in some rare cases, resources may be left alone after transaction gets aborted and umount may run into some ASSERT(), e.g. ASSERT(list_empty(&block_group->dirty_list)); For case a), in btrfs_commit_transaciton(), there're several places at the beginning where we just call btrfs_end_transaction() without cleaning up resources. For case b), it is possible that the trans handle doesn't have any dirty stuff, then only trans hanlde is marked as aborted while BTRFS_FS_STATE_ERROR is not set, so resources remain in memory. This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that all resources won't stay in memory after umount. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: print error messages when failing to read treesLiu Bo2018-03-311-9/+22
| | | | | | | | | | | | | | | When mount fails to read trees like fs tree, checksum tree, extent tree, etc, there is not enough information about where went wrong. With this, messages like "BTRFS warning (device sdf): failed to read root (objectid=7): -5" would help us a bit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Validate child tree block's level and first keyQu Wenruo2018-03-311-13/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved spaceQu Wenruo2018-03-311-0/+1
| | | | | | | | | | | | | | | | | | | | | For quota disabled->enable case, it's possible that at reservation time quota was not enabled so no bytes were really reserved, while at release time, quota was enabled so we will try to release some bytes we didn't really own. Such situation can cause metadata reserveation underflow, for both types, also less possible for per-trans type since quota enable will commit transaction. To address this, record qgroup meta reserved bytes into root::qgroup_meta_rsv_pertrans and ::prealloc. So at releasing time we won't free any bytes we didn't reserve. For DATA, it's already handled by io_tree, so nothing needs to be done there. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroupQu Wenruo2018-03-311-1/+0Star
| | | | | | | | | Since qgroup has seperate metadata reservation types now, we can completely get rid of the old root->qgroup_meta_rsv, which mostly acts as current META_PERTRANS reservation type. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: defer adding raid type kobject until after chunk relocationJeff Mahoney2018-03-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | Any time the first block group of a new type is created, we add a new kobject to sysfs to hold the attributes for that type. Kobject-internal allocations always use GFP_KERNEL, making them prone to fs-reclaim races. While it appears as if this can occur any time a block group is created, the only times the first block group of a new type can be created in memory is at mount and when we create the first new block group during raid conversion. This patch adds a new list to track pending kobject additions and then handles them after we do chunk relocation. Between relocating the target chunk (or forcing allocation of a new chunk in the case of data) and removing the old chunk, we're in a safe place for fs-reclaim to occur. We're holding the volume mutex, which is already held across page faults, and the delete_unused_bgs_mutex, which will only stall the cleaner thread. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix lockdep splat in btrfs_alloc_subvolume_writersJeff Mahoney2018-03-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running btrfs/011, I hit the following lockdep splat. This is the important bit: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] The percpu_counter_init call in btrfs_alloc_subvolume_writers uses GFP_KERNEL, which we can't do during transaction commit. This switches it to GFP_NOFS. ======================================================== WARNING: possible irq lock inversion dependency detected 4.12.14-kvmsmall #8 Tainted: G W -------------------------------------------------------- kswapd0/50 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (pcpu_alloc_mutex){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_alloc_mutex); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); *** DEADLOCK *** 2 locks held by kswapd0/50: #0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0 #1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50 the shortest dependencies between 2nd lock and 1st lock: -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 RECLAIM_FS-ON-W at: __kmalloc+0x47/0x310 pcpu_extend_area_map+0x2b/0xc0 pcpu_alloc+0x3ec/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 __kmem_cache_create+0x1bf/0x390 create_cache+0xba/0x1b0 kmem_cache_create+0x1f8/0x2b0 ksm_init+0x6f/0x19d do_one_initcall+0x50/0x1b0 kernel_init_freeable+0x201/0x289 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 setup_cpu_cache+0x2f/0x1f0 __kmem_cache_create+0x1bf/0x390 create_boot_cache+0x8b/0xb1 kmem_cache_init+0xa1/0x19e start_kernel+0x270/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 } ... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0 ... acquired at: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] transaction_kthread+0x176/0x1b0 [btrfs] kthread+0x102/0x140 ret_from_fork+0x3a/0x50 -> (&fs_info->commit_root_sem){++++..} ops: 1566382 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs] ... acquired at: cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] btrfs_create_tree+0xbb/0x2a0 [btrfs] btrfs_create_uuid_tree+0x37/0x140 [btrfs] open_ctree+0x23c0/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&found->groups_sem){++++..} ops: 2134587 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 INITIAL USE at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs] ... acquired at: find_free_extent+0xcb4/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] __btrfs_cow_block+0x110/0x5b0 [btrfs] btrfs_cow_block+0xd7/0x290 [btrfs] btrfs_search_slot+0x1f6/0x960 [btrfs] btrfs_lookup_inode+0x2a/0x90 [btrfs] __btrfs_update_delayed_inode+0x65/0x210 [btrfs] btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs] btrfs_evict_inode+0x3fe/0x6a0 [btrfs] evict+0xc4/0x190 __dentry_kill+0xbf/0x170 dput+0x2ae/0x2f0 SyS_rename+0x2a6/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&delayed_node->mutex){+.+.-.} ops: 5580204 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 IN-RECLAIM_FS-W at: __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs] ... acquired at: __lock_acquire+0x264/0x11c0 lock_acquire+0xbd/0x1e0 __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 stack backtrace: CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x78/0xb7 print_irq_inversion_bug.part.38+0x19f/0x1aa check_usage_forwards+0x102/0x120 ? ret_from_fork+0x3a/0x50 ? check_usage_backwards+0x110/0x110 mark_lock+0x16c/0x270 __lock_acquire+0x264/0x11c0 ? pagevec_lookup_entries+0x1a/0x30 ? truncate_inode_pages_range+0x2b3/0x7f0 lock_acquire+0xbd/0x1e0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] __mutex_lock+0x4e/0x8c0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ? mem_cgroup_shrink_node+0x2c0/0x2c0 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x3a/0x50 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove max_active var from open_ctreeNikolay Borisov2018-03-311-3/+0Star
| | | | | | | | | Introduced by 5cdc7ad337fb ("btrfs: Replace fs_info->workers with btrfs_workqueue.") but obsoleted by 2a4581983f90 ("btrfs: factor btrfs_init_workqueues() out of open_ctree()"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Use sizeof directly instead of a constant variableNikolay Borisov2018-03-311-3/+2Star
| | | | | | | | | | | | | | The kernel would like to have all stack VLA usage removed[1]. Unfortunately using an integer constant variable as the size of an array is still considered a VLA. Instead let's use directly sizeof(var) which removes the VLA usage. Use the occasion to remove csum_size altogether and use sizeof() also for the size passed to memcmp [1]: https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename submit callbacks and drop double underscoresDavid Sterba2018-03-311-4/+4
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameters from extent_submit_bio_done_tDavid Sterba2018-03-311-4/+2Star
| | | | | | Remove parameters not used by any of the callbacks. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameters from extent_submit_bio_start_tDavid Sterba2018-03-311-2/+0Star
| | | | | | Remove parameters not used by any of the callbacks. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: separate types for submit_bio_start and submit_bio_doneDavid Sterba2018-03-311-4/+4
| | | | | | | | | | | The callbacks make use of different parameters that are passed to the other type unnecessarily. This patch adds separate types for each and the unused parameters will be removed. The type extent_submit_bio_hook_t keeps all parameters and can be used where the start/done types are not appropriate. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devidsAnand Jain2018-03-261-4/+4
| | | | | | | | | | | This function btrfs_close_extra_devices() is about freeing extra devids which once it may have belonged to this filesystem. So rename it and add the comment. The _devid suffix is appropriate as this function won't handle devices which are outside of the filesytem being mounted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add more __cold annotationsDavid Sterba2018-03-261-1/+1
| | | | | | | | | | | | | | | | The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove custom crc32c init codeNikolay Borisov2018-03-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The custom crc32 init code was introduced in 14a958e678cd ("Btrfs: fix btrfs boot when compiled as built-in") to enable using btrfs as a built-in. However, later as pointed out by 60efa5eb2e88 ("Btrfs: use late_initcall instead of module_init") this wasn't enough and finally btrfs was switched to late_initcall which comes after the generic crc32c implementation is initiliased. The latter commit superseeded the former. Now that we don't have to maintain our own code let's just remove it and switch to using the generic implementation. Despite touching a lot of files the patch is really simple. Here is the gist of the changes: 1. Select LIBCRC32C rather than the low-level modules. 2. s/btrfs_crc32c/crc32c/g 3. replace hash.h with linux/crc32c.h 4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly. I've tested this with btrfs being both a module and a built-in and xfstest doesn't complain. Does seem to fix the longstanding problem of not automatically selectiong the crc32c module when btrfs is used. Possibly there is a workaround in dracut. The modinfo confirms that now all the module dependencies are there: before: depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate after: depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add more info to changelog from mails ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: tree-checker: Replace root parameter with fs_infoQu Wenruo2018-03-261-3/+3
| | | | | | | | | | | | | | | | | | When inspecting the error message with real corruption, the "root=%llu" always shows "1" (root tree), instead of the correct owner. The problem is that we are getting @root from page->mapping->host, which points the same btree inode, so we will always get the same root. This makes the root owner output meaningless, and harder to port tree-checker to btrfs-progs. So get rid of the false and meaningless @root parameter and replace it with @fs_info. To get the owner, we can only rely on btrfs_header_owner() now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Use schedule_timeout_interruptibleNikolay Borisov2018-03-261-3/+1Star
| | | | | | | | | | | Instead of manually fiddling with the state of the task (RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible which adjusts the task state as needed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused function btrfs_async_submit_limit()Anand Jain2018-03-261-8/+0Star
| | | | | | | | | | | | | | Commit [1] removed the need to use btrfs_async_submit_limit(), so delete it. [1] commit 736cd52e0c720103f52ab9da47b6cc3af6b083f6 Btrfs: remove nr_async_submits and async_submit_draining Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Document consistency of transaction->io_bgs listNikolay Borisov2018-03-261-0/+4
| | | | | | | | | The reason why io_bgs can be modified without holding any lock is non-obvious. Document it and reference that documentation from the respective call sites. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove invalid null checks from btrfs_cleanup_dirty_bgsNikolay Borisov2018-03-261-9/+0Star
| | | | | | | | | | | | list_first_entry is essentially a wrapper over cotnainer_of. The latter can never return null even if it's working on inconsistent list since it will either crash or return some offset in the wrong struct. Additionally, for the dirty_bgs list the iteration is done under dirty_bgs_lock which ensures consistency of the list. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: manage thread_pool mount option as %uAnand Jain2018-03-261-2/+2
| | | | | | | | | | The mount option thread_pool is always unsigned. Manage it that way all around. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: print error if primary super block write failsHoward McLauchlan2018-03-261-1/+14
| | | | | | | | | | | | | | | | | | | Presently, failing a primary super block write but succeeding in at least one super block write in general will appear to users as if nothing important went wrong. However, upon unmounting and re-mounting, the file system will be in a rolled back state. This was discovered with a BCC program that uses bpf_override_return() to fail super block writes. This patch outputs an error clarifying that the primary super block write has failed, so users can expect potentially erroneous behaviour. It also forces wait_dev_supers() to return an error to its caller if the primary super block write fails. Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2018-01-311-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
| * error-injection: Add injectable error typesMasami Hiramatsu2018-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add injectable error types for each error-injectable function. One motivation of error injection test is to find software flaws, mistakes or mis-handlings of expectable errors. If we find such flaws by the test, that is a program bug, so we need to fix it. But if the tester miss input the error (e.g. just return success code without processing anything), it causes unexpected behavior even if the caller is correctly programmed to handle any errors. That is not what we want to test by error injection. To clarify what type of errors the caller must expect for each injectable function, this introduces injectable error types: - EI_ETYPE_NULL : means the function will return NULL if it fails. No ERR_PTR, just a NULL. - EI_ETYPE_ERRNO : means the function will return -ERRNO if it fails. - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO (ERR_PTR) or NULL. ALLOW_ERROR_INJECTION() macro is expanded to get one of NULL, ERRNO, ERRNO_NULL to record the error type for each function. e.g. ALLOW_ERROR_INJECTION(open_ctree, ERRNO) This error types are shown in debugfs as below. ==== / # cat /sys/kernel/debug/error_injection/list open_ctree [btrfs] ERRNO io_ctl_init [btrfs] ERRNO ==== Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * error-injection: Separate error-injection from kprobeMasami Hiramatsu2018-01-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since error-injection framework is not limited to be used by kprobes, nor bpf. Other kernel subsystems can use it freely for checking safeness of error-injection, e.g. livepatch, ftrace etc. So this separate error-injection framework from kprobes. Some differences has been made: - "kprobe" word is removed from any APIs/structures. - BPF_ALLOW_ERROR_INJECTION() is renamed to ALLOW_ERROR_INJECTION() since it is not limited for BPF too. - CONFIG_FUNCTION_ERROR_INJECTION is the config item of this feature. It is automatically enabled if the arch supports error injection feature for kprobe or ftrace etc. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2017-12-181-0/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2017-12-18 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Allow arbitrary function calls from one BPF function to another BPF function. As of today when writing BPF programs, __always_inline had to be used in the BPF C programs for all functions, unnecessarily causing LLVM to inflate code size. Handle this more naturally with support for BPF to BPF calls such that this __always_inline restriction can be overcome. As a result, it allows for better optimized code and finally enables to introduce core BPF libraries in the future that can be reused out of different projects. x86 and arm64 JIT support was added as well, from Alexei. 2) Add infrastructure for tagging functions as error injectable and allow for BPF to return arbitrary error values when BPF is attached via kprobes on those. This way of injecting errors generically eases testing and debugging without having to recompile or restart the kernel. Tags for opting-in for this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef. 3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper call for XDP programs. First part of this work adds handling of BPF capabilities included in the firmware, and the later patches add support to the nfp verifier part and JIT as well as some small optimizations, from Jakub. 4) The bpftool now also gets support for basic cgroup BPF operations such as attaching, detaching and listing current BPF programs. As a requirement for the attach part, bpftool can now also load object files through 'bpftool prog load'. This reuses libbpf which we have in the kernel tree as well. bpftool-cgroup man page is added along with it, from Roman. 5) Back then commit e87c6bc3852b ("bpf: permit multiple bpf attachments for a single perf event") added support for attaching multiple BPF programs to a single perf event. Given they are configured through perf's ioctl() interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF command in this work in order to return an array of one or multiple BPF prog ids that are currently attached, from Yonghong. 6) Various minor fixes and cleanups to the bpftool's Makefile as well as a new 'uninstall' and 'doc-uninstall' target for removing bpftool itself or prior installed documentation related to it, from Quentin. 7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is required for the test_dev_cgroup test case to run, from Naresh. 8) Fix reporting of XDP prog_flags for nfp driver, from Jakub. 9) Fix libbpf's exit code from the Makefile when libelf was not found in the system, also from Jakub. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * btrfs: make open_ctree error injectableJosef Bacik2017-12-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | This allows us to do error injection with BPF for open_ctree. Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | | btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPPAnand Jain2018-01-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It appears from the original commit [1] that there isn't any design specific reason not to fail the mount instead of just warning. This patch will change it to fail. [1] commit 319e4d0661e5323c9f9945f0f8fb5905e5fe74c3 btrfs: Enhance super validation check Fixes: 319e4d0661e5323 ("btrfs: Enhance super validation check") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: define SUPER_FLAG_METADUMP_V2Anand Jain2018-01-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs-progs uses super flag bit BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34). So just define that in kernel so that we know its been used. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: factor btrfs_check_rw_degradable() to check given deviceAnand Jain2018-01-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update btrfs_check_rw_degradable() to check against the given device if its lost. We can use this function to know if the volume is going to be in degraded mode OR failed state, when the given device fails. Which is needed when we are handling the device failed state. A preparatory patch does not affect the flow as such. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ enhance comment ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: sink unlock_extent parameter gfp_flagsDavid Sterba2018-01-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the parameter to the function, though we lose some of the slightly better semantics of GFP_KERNEL in some places, it's worth cleaning up the callchains. Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENTAnand Jain2018-01-221-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: cleanup device states define BTRFS_DEV_STATE_MISSINGAnand Jain2018-01-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by : Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATAAnand Jain2018-01-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLEAnand Jain2018-01-221-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ whitespace adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: switch to on-stack csum buffer in csum_tree_blockDavid Sterba2018-01-221-13/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | The maximum size of a checksum buffer is known, BTRFS_CSUM_SIZE, and we don't have to allocate it dynamically. This code path is not used at all as we have only the crc32c and use an on-stack buffer already. Signed-off-by: David Sterba <dsterba@suse.com>
* | | btrfs: sink get_extent parameter to read_extent_buffer_pagesDavid Sterba2018-01-221-4/+4
| | | | | | | | | | | | | | | | | | All callers pass btree_get_extent, which needs to be exported. Signed-off-by: David Sterba <dsterba@suse.com>