summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
...
* | btrfs: fix use-after-free due to race between replace start and cancelAnand Jain2018-12-171-22/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device replace cancel thread can race with the replace start thread and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will fail to stop the scrub thread. The scrub thread continues with the scrub for replace which then will try to write to the target device and which is already freed by the cancel thread. scrub_setup_ctx() warns as tgtdev is NULL. struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace) { ... if (is_dev_replace) { WARN_ON(!fs_info->dev_replace.tgtdev); <=== sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO; sctx->wr_tgtdev = fs_info->dev_replace.tgtdev; sctx->flush_all_writes = false; } [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.432970] Call Trace: [ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs] [ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs] [ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs] [ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs] [ 6852.434365] ? do_sigaction+0x7d/0x1e0 [ 6852.434623] do_vfs_ioctl+0xa9/0x6c0 [ 6852.434865] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435124] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435387] ksys_ioctl+0x60/0x90 [ 6852.435663] __x64_sys_ioctl+0x16/0x20 [ 6852.435907] do_syscall_64+0x50/0x180 [ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe Further, as the replace thread enters scrub_write_page_to_dev_replace() without the target device it panics: static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { ... bio_set_dev(bio, sbio->dev->bdev); <====== [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0 .. [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs] [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260 [btrfs] .. [ 6929.721430] Call Trace: [ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs] [ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs] [ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs] [ 6929.722552] process_one_work+0x1f4/0x520 [ 6929.722805] ? process_one_work+0x16e/0x520 [ 6929.723063] worker_thread+0x46/0x3d0 [ 6929.723313] kthread+0xf8/0x130 [ 6929.723544] ? process_one_work+0x520/0x520 [ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 6929.724081] ret_from_fork+0x3a/0x50 Fix this by letting the btrfs_dev_replace_finishing() to do the job of cleaning after the cancel, including freeing of the target device. btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns along with the scrub return status. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: dev-replace: go back to suspend state if another EXCL_OP is runningAnand Jain2018-12-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a secnario where balance and replace co-exists as below, - start balance - pause balance - start replace - reboot and when system restarts, balance resumes first. Then the replace is attempted to restart but will fail as the EXCL_OP lock is already held by the balance. If so place the replace state back to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state. Fixes: 010a47bde9420 ("btrfs: add proper safety check before resuming dev-replace") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: dev-replace: go back to suspended state if target device is missingAnand Jain2018-12-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the time of forced unmount we place the running replace to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes back and expect the target device is missing. Then let the replace state continue to be in BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub running as part of replace. Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: mark btrfs_dev_replace_start as staticAnand Jain2018-12-172-4/+1Star
| | | | | | | | | | | | | | | | There isn't any other consumer other than in its own file dev-replace.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: harden agaist duplicate fsid on scanned devicesAnand Jain2018-12-171-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not that impossible to imagine that a device OR a btrfs image is copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after it's been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part / |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda $ dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 14.9G 0 disk |-sda4 8:4 1 4G 0 part / |-sda2 8:2 1 500M 0 part |-sda3 8:3 1 256M 0 part `-sda1 8:1 1 256M 0 part mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: introduce nparity raid_attrHans van Kranenburg2018-12-172-7/+13
| | | | | | | | | | | | | | | | Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an nparity field in raid_attr. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: fix ncopies raid_attr for RAID56Hans van Kranenburg2018-12-171-2/+2
| | | | | | | | | | | | | | | | | | RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These values are not yet used anywhere so there's no change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: alloc_chunk: fix more DUP stripe size handlingHans van Kranenburg2018-12-171-9/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling" fixed calculating the stripe_size for a new DUP chunk. However, the same calculation reappears a bit later, and that one was not changed yet. The resulting bug that is exposed is that the newly allocated device extents ('stripes') can have a few MiB overlap with the next thing stored after them, which is another device extent or the end of the disk. The scenario in which this can happen is: * The block device for the filesystem is less than 10GiB in size. * The amount of contiguous free unallocated disk space chosen to use for chunk allocation is 20% of the total device size, or a few MiB more or less. An example: - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB) - There's 1578MiB unallocated raw disk space left in one contiguous piece. In this case stripe_size is first calculated as 789MiB, (half of 1578MiB). Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we enter the if block. Now stripe_size value is immediately overwritten while calculating an adjusted value based on max_chunk_size, which ends up as 788MiB. Next, the value is rounded up to a 16MiB boundary, 800MiB, which is actually more than the value we had before. However, the last comparison fails to detect this, because it's comparing the value with the total amount of free space, which is about twice the size of stripe_size. In the example above, this means that the resulting raw disk space being allocated is 1600MiB, while only a gap of 1578MiB has been found. The second device extent object for this DUP chunk will overlap for 22MiB with whatever comes next. The underlying problem here is that the stripe_size is reused all the time for different things. So, when entering the code in the if block, stripe_size is immediately overwritten with something else. If later we decide we want to have the previous value back, then the logic to compute it was copy pasted in again. With this change, the value in stripe_size is not unnecessarily destroyed, so the duplicated calculation is not needed any more. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: alloc_chunk: improve chunk size variable nameHans van Kranenburg2018-12-171-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | The variable num_bytes is really a way too generic name for a variable in this function. There are a dozen other variables that hold a number of bytes as value. Give it a name that actually describes what it does, which is holding the size of the chunk that we're allocating. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: alloc_chunk: do not refurbish num_bytesHans van Kranenburg2018-12-171-4/+3Star
| | | | | | | | | | | | | | | | | | | | The variable num_bytes is used to store the chunk length of the chunk that we're allocating. Do not reuse it for something really different in the same function. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: use tagged writepage to mitigate livelock of snapshotEthan Lien2018-12-175-8/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove unused extent_state argument from ↵Nikolay Borisov2018-12-174-14/+11Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_writepage_endio_finish_ordered This parameter was never used, yet was part of the interface of the function ever since its introduction as extent_io_ops::writepage_end_io_hook in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that NULL is passed everywhere as a value for this parameter let's remove it for good. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_page_data argument from writepage_delallocNikolay Borisov2018-12-171-7/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The only remaining use of the 'epd' argument in writepage_delalloc is to reference the extent_io_tree which was set in extent_writepages. Since it is guaranteed that page->mapping of any page passed to writepage_delalloc (and __extent_writepage as the sole caller) to be equal to that passed in extent_writepages we can directly get the io_tree via the already passed inode (which is also taken from page->mapping->host). No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Move epd::extent_locked check to writepage_delalloc's callerNikolay Borisov2018-12-171-7/+8
| | | | | | | | | | | | | | | | | | | | | | If epd::extent_locked is set then writepage_delalloc terminates. Make this a bit more apparent in the caller by simply bubbling the check up. This enables to remove epd as an argument to writepage_delalloc in a future patch. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Check for missing device before bio submission in btrfs_map_bioNikolay Borisov2018-12-171-7/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before btrfs_map_bio submits all stripe bios it does a number of checks to ensure the device for every stripe is present. However, it doesn't do a DEV_STATE_MISSING check, instead this is relegated to the lower level btrfs_schedule_bio (in the async submission case, sync submission doesn't check DEV_STATE_MISSING at all). Additionally btrfs_schedule_bios does the duplicate device->bdev check which has already been performed in btrfs_map_bio. This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and removes the duplicate device->bdev check. Doing so ensures that no bio cloning/submission happens for both async/sync requests in the face of missing device. This makes the async io submission path slightly shorter in terms of instruction count. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: remove redundant replace_state initAnand Jain2018-12-171-1/+0Star
| | | | | | | | | | | | | | | | | | | | dev_replace::replace_state has been set to BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function, So delete the line which sets replace_state = 0; Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: remove no longer used io_err from btrfs_log_ctxFilipe Manana2018-12-172-21/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The io_err field of struct btrfs_log_ctx is no longer used after the recent simplification of the fast fsync path, where we now wait for ordered extents to complete before logging the inode. We did this in commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") removed its last use. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: simpler and more efficient cleanup of a log tree's extent io treeFilipe Manana2018-12-171-14/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently are in a loop finding each range (corresponding to a btree node/leaf) in a log root's extent io tree and then clean it up. This is a waste of time since we are traversing the extent io tree's rb_tree more times then needed (one for a range lookup and another for cleaning it up) without any good reason. We free the log trees when we are in the critical section of a transaction commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's of great convenience to do everything as fast as possible in order to reduce the time we block other tasks from starting a new transaction. So fix this by traversing the extent io tree once and cleaning up all its records in one go while traversing it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Adjust loop in free_extent_bufferNikolay Borisov2018-12-173-9/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The loop construct in free_extent_buffer was added in 242e18c7c1a8 ("Btrfs: reduce lock contention on extent buffer locks") as means of reducing the times the eb lock is taken, the non-last ref count is decremented and lock is released. As the special handling of UNMAPPED extent buffers was removed now there is only one decrement op which is happening for EXTENT_BUFFER_UNMAPPED case. This commit modifies the loop condition so that in case of UNMAPPED buffers the eb's lock is taken only if we are 100% sure the eb is going to be freed by the current executor of the code. Additionally, remove superfluous ref count ops in btrfs test. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeingNikolay Borisov2018-12-172-12/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the whole of btrfs code has been audited for eb reference count management it's time to remove the hunk in free_extent_buffer that essentially considered the condition "eb->ref == 2 && EXTENT_BUFFER_DUMMY" to equal "eb->ref = 1". Also remove the last location which takes an extra reference count in alloc_test_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove unnecessary tree locking code in qgroup_rescan_leafNikolay Borisov2018-12-171-6/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In qgroup_rescan_leaf a copy is made of the target leaf by calling btrfs_clone_extent_buffer. The latter allocates a new buffer and attaches a new set of pages and copies the content of the source buffer. The new scratch buffer is only used to iterate it's items, it's not published anywhere and cannot be accessed by a third party. Hence, it's not necessary to perform any locking on it whatsoever. Furthermore, remove the extra extent_buffer_get call since the new buffer is always allocated with a reference count of 1 which is sufficient here. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extra reference count bumps in btrfs_compare_treesNikolay Borisov2018-12-171-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | When the 2 comparison trees roots are initialised they are private to the function and already have reference counts of 1 each. There is no need to further increment the reference count since the cloned buffers are already accessed via struct btrfs_path. Eventually the 2 paths used for comparison are going to be released, effectively disposing of the cloned buffers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewindNikolay Borisov2018-12-171-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a rewound buffer is created it already has a ref count of 1 and the dummy flag set. Then another ref is taken bumping the count to 2. Finally when this buffer is released from btrfs_release_path the extra reference is decremented by the special handling code in free_extent_buffer. However, this special code is in fact redundant sinca ref count of 1 is still correct since the buffer is only accessed via btrfs_path struct. This paves the way forward of removing the special handling in free_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove redundant extent_buffer_get in get_old_rootNikolay Borisov2018-12-171-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_old_root used used only by btrfs_search_old_slot to initialise the path structure. The old root is always a cloned buffer (either via alloc dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1 from allocation, 1 from extent_buffer_get call in get_old_root. This latter explicit ref count acquire operation is in fact unnecessary since the semantic is such that the newly allocated buffer is handed over to the btrfs_path for lifetime management. Considering this just remove the extra extent_buffer_get in get_old_root. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove needless tree locking in iterate_inode_extrefsNikolay Borisov2018-12-171-5/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer which creates a private extent buffer with the dummy flag set and ref count of 1. Then this buffer is locked for reading and its ref count is incremented by 1. Finally it's fed to the passed iterate_irefs_t function. The actual iterate call back is inode_to_path (coming from paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final function the passed eb is only read by first assigning it to the local eb variable. This variable is only modified in the case another eb was referenced from the passed path that is eb != eb_in check triggers. Considering this there is no point in locking the cloned eb in iterate_inode_refs since it's never being modified and is not published anywhere. Furthermore the cloned eb is completely fine having its ref count be 1. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove needless tree locking in iterate_inode_refsNikolay Borisov2018-12-171-4/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer which creates a private extent buffer with the dummy flag set and ref count of 1. Then this buffer is locked for reading and its ref count is incremented by 1. Finally it's fed to the passed iterate_irefs_t function. The actual iterate call back is inode_to_path (coming from paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final function the passed eb is only read by first assigning it to the local eb variable. This variable is only modified in the case another eb was referenced from the passed path that is eb != eb_in check triggers. Considering this there is no point in locking the cloned eb in iterate_inode_refs since it's never being modified and is not published anywhere. Furthermore the cloned eb is completely fine having its ref count be 1. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: tests: Use BTRFS_MAX_EXTENT_SIZE to replace the intermediate numberQu Wenruo2018-12-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In extent-io self test, we need 2 ordered extents at its maximum size to do the test. Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for @max_bytes, and twice @max_bytes for @total_dirty. This should explain why we need all these magic numbers and prevent people to modify them by accident. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: support swap filesOmar Sandoval2018-12-171-0/+340
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions"). However, now that the proper restrictions are in place, Btrfs can support swap files through the swap file a_ops, similar to iomap in commit 67482129cdab ("iomap: add a swapfile activation function"). For Btrfs, activation needs to make sure that the file can be used as a swap file, which currently means that it must be fully allocated as NOCOW with no compression on one device. It must also do the proper tracking so that ioctls will not interfere with the swap file. Deactivation clears this tracking. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: rename and export get_chunk_mapOmar Sandoval2018-12-172-11/+20
| | | | | | | | | | | | | | | | | | | | The Btrfs swap code is going to need it, so give it a btrfs_ prefix and make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: prevent ioctls from interfering with a swap fileOmar Sandoval2018-12-176-12/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::split_extent_hook callbackNikolay Borisov2018-12-174-20/+6Star
| | | | | | | | | | | | | | | | | | | | | | This is the counterpart to merge_extent_hook, similarly, it's used only for data/freespace inodes so let's remove it, rename it and call it directly where necessary. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::merge_extent_hook callbackNikolay Borisov2018-12-174-22/+16Star
| | | | | | | | | | | | | | | | | | | | | | | | This callback is used only for data and free space inodes. Such inodes are guaranteed to have their extent_io_tree::private_data set to the inode struct. Exploit this fact to directly call the function. Also give it a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::clear_bit_hook callbackNikolay Borisov2018-12-174-18/+12Star
| | | | | | | | | | | | | | | | | | | | | | This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent), similar to what was done before remove clear_bit_hook and rename the function. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::set_bit_hook extent_io callbackNikolay Borisov2018-12-175-19/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This callback is used to properly account delalloc extents for data inodes (ordinary file inodes and freespace v1 inodes). Those can be easily identified since they have their extent_io trees ->private_data member point to the inode. Let's exploit this fact to remove the needless indirection through extent_io_hooks and directly call the function. Also give the function a name which reflects its purpose - btrfs_set_delalloc_extent. This patch also modified test_find_delalloc so that the extent_io_tree used for testing doesn't have its ->private_data set which would have caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root member not being initialised. The old version of the code also didn't call set_bit_hook since the extent_io ops weren't set for the inode. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::check_extent_io_range callbackNikolay Borisov2018-12-173-20/+12Star
| | | | | | | | | | | | | | | | | | | | | | | | | | This callback was only used in debug builds by btrfs_leak_debug_check. A better approach is to move its implementation in btrfs_leak_debug_check and ensure the latter is only executed for extent tree which have ->private_data set i.e. relate to a data node and not the btree one. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::writepage_end_io_hookNikolay Borisov2018-12-175-36/+21Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This callback is ony ever called for data page writeout so there is no need to actually abstract it via extent_io_ops. Lets just export it, remove the definition of the callback and call it directly in the functions that invoke the callback. Also rename the function to btrfs_writepage_endio_finish_ordered since what it really does is account finished io in the ordered extent data structures. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::writepage_start_hookNikolay Borisov2018-12-174-16/+12Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This hook is called only from __extent_writepage_io which is already called only from the data page writeout path. So there is no need to make an indirect call via extent_io_ops. This patch just removes the callback definition, exports the callback function and calls it directly at the only call site. Also give the function a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove extent_io_ops::fill_delallocNikolay Borisov2018-12-174-24/+19Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This callback is called only from writepage_delalloc which in turn is guaranteed to be called from the data page writeout path. In the end there is no reason to have the call to this function to be indrected via the extent_io_ops structure. This patch removes the callback definition, exports the function and calls it directly. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_run_delalloc_range ] Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Add function to distinguish between data and btree inodeNikolay Borisov2018-12-171-0/+5
| | | | | | | | | | | | | | | | | | This will be used in future patches that remove the optional extent_io_ops callbacks. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: volumes: Make sure no dev extent is beyond device boundaryQu Wenruo2018-12-171-0/+17
| | | | | | | | | | | | | | | | Add extra dev extent end check against device boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: volumes: Make sure there is no overlap of dev extents at mount timeQu Wenruo2018-12-171-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enhance btrfs_verify_dev_extents() to remember previous checked dev extents, so it can verify no dev extents can overlap. Analysis from Hans: "Imagine allocating a DATA|DUP chunk. In the chunk allocator, we first set... max_stripe_size = SZ_1G; max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ... which is 10GiB. Then... /* we don't want a chunk larger than 10% of writeable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); Imagine we only have one 7880MiB block device in this filesystem. Now max_chunk_size is down to 788MiB. The next step in the code is to search for max_stripe_size * dev_stripes amount of free space on the device, which is in our example 1GiB * 2 = 2GiB. Imagine the device has exactly 1578MiB free in one contiguous piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail Next we recalculate the stripe_size (which is actually the device extent length), based on the actual maximum amount of available raw disk space: stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); stripe_size is now 789MiB Next we do... data_stripes = num_stripes / ncopies ...where data_stripes ends up as 1, because num_stripes is 2 (the amount of device extents we're going to have), and DUP has ncopies 2. Next there's a check... if (stripe_size * data_stripes > max_chunk_size) ...which matches because 789MiB * 1 > 788MiB. We go into the if code, and next is... stripe_size = div_u64(max_chunk_size, data_stripes); ...which resets stripe_size to max_chunk_size: 788MiB Next is a fun one... /* bump the answer up to a 16MB boundary */ stripe_size = round_up(stripe_size, SZ_16M); ...which changes stripe_size from 788MiB to 800MiB. We're not done changing stripe_size yet... /* But don't go higher than the limits we found while searching * for free extents */ stripe_size = min(devices_info[ndevs - 1].max_avail, stripe_size); This is bad. max_avail is twice the stripe_size (we need to fit 2 device extents on the same device for DUP). The result here is that 800MiB < 1578MiB, so it's unchanged. However, the resulting DUP chunk will need 1600MiB disk space, which isn't there, and the second dev_extent might extend into the next thing (next dev_extent? end of device?) for 22MiB. The last shown line of code relies on a situation where there's twice the value of stripe_size present as value for the variable stripe_size when it's DUP. This was actually the case before commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote: "[...] in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size." In the previous version of the code, the 16MiB alignment (why is this done, by the way?) would result in a 50% chance that it would actually do an 8MiB alignment for the individual dev_extents, since it was operating on double the size. Does this matter? Does it matter that stripe_size can be set to anything which is not 16MiB aligned because of the amount of remaining available disk space which is just taken? What is the main purpose of this round_up? The most straightforward thing to do seems something like... stripe_size = min( div_u64(devices_info[ndevs - 1].max_avail, dev_stripes), stripe_size ) ..just putting half of the max_avail into stripe_size." Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/ Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ add analysis from report ] Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Refactor find_free_extent loops update into find_free_extent_update_loopQu Wenruo2018-12-171-100/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a complex loop design for find_free_extent(), that has different behavior for each loop, some even includes new chunk allocation. Instead of putting such a long code into find_free_extent() and makes it harder to read, just extract them into find_free_extent_update_loop(). With all the cleanups, the main find_free_extent() should be pretty barebone: find_free_extent() |- Iterate through all block groups | |- Get a valid block group | |- Try to do clustered allocation in that block group | |- Try to do unclustered allocation in that block group | |- Check if the result is valid | | |- If valid, then exit | |- Jump to next block group | |- Push harder to find free extents |- If not found, re-iterate all block groups Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> [ copy callchain from changelog to function comment ] Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Refactor unclustered extent allocation into ↵Qu Wenruo2018-12-171-46/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_free_extent_unclustered() This patch will extract unclsutered extent allocation code into find_free_extent_unclustered(). And this helper function will use return value to indicate what to do next. This should make find_free_extent() a little easier to read. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> [Update merge conflict with fb5c39d7a887 ("btrfs: don't use ctl->free_space for max_extent_size")] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Refactor clustered extent allocation into find_free_extent_clusteredQu Wenruo2018-12-171-123/+122Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have two main methods to find free extents inside a block group: 1) clustered allocation 2) unclustered allocation This patch will extract the clustered allocation into find_free_extent_clustered() to make it a little easier to read. Instead of jumping between different labels in find_free_extent(), the helper function will use return value to indicate different behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Introduce find_free_extent_ctl structure for later reworkQu Wenruo2018-12-171-95/+170
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of tons of different local variables in find_free_extent(), extract them into find_free_extent_ctl structure, and add better explanation for them. Some modification may looks redundant, but will later greatly simplify function parameter list during find_free_extent() refactor. Also add two comments to co-operate with fb5c39d7a887 ("btrfs: don't use ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size update more reader-friendly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: extent-tree: Detect bytes_pinned underflow earlierLu Fengqi2018-12-171-4/+5
| | | | | | | | | | | | | | | | | | | | Introduce a new wrapper update_bytes_pinned to replace open coded bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned get detected and reported. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: extent-tree: Detect bytes_may_use underflow earlierQu Wenruo2018-12-171-16/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although we have space_info::bytes_may_use underflow detection in btrfs_free_reserved_data_space_noquota(), we have more callers who are subtracting number from space_info::bytes_may_use. So instead of doing underflow detection for every caller, introduce a new wrapper update_bytes_may_use() to replace open coded bytes_may_use modifiers. This also introduce a macro to declare more wrappers, but currently space_info::bytes_may_use is the mostly interesting one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: remove no longer used stuff for tracking pending ordered extentsFilipe Manana2018-12-174-45/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tracking pending ordered extents per transaction was introduced in commit 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change how we wait for pending ordered extents"). However now that on fsync we always wait for ordered extents to complete before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"), we no longer need the stuff to track for pending ordered extents, which was not completely removed in the mentioned commit. So remove the remaining of the pending ordered extents infrastructure. Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: remove no longer used logged range variables when logging extentsFilipe Manana2018-12-171-8/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | The logged_start and logged_end variables, at btrfs_log_changed_extents, were added in commit 8c6c592831a0 ("btrfs: log csums for all modified extents"). However since the recent simplification for fsync, which makes us wait for all ordered extents to complete before logging extents, we no longer need those variables. Commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") forgot to remove them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-4.20-rc5-tag' of ↵Linus Torvalds2018-12-051-5/+3Star
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A patch in 4.19 introduced a sanity check that was too strict and a filesystem cannot be mounted. This happens for filesystems with more than 10 devices and has been reported by a few users so we need the fix to propagate to stable" * tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable