summaryrefslogtreecommitdiffstats
path: root/fs/xfs/xfs_aops.c
Commit message (Collapse)AuthorAgeFilesLines
* writeback: add wbc_to_write_flags()Jens Axboe2016-11-021-6/+2Star
| | | | | | | | | | Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
* block,fs: use REQ_* flags directlyChristoph Hellwig2016-11-011-4/+7
| | | | | | | | | Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* xfs: convert COW blocks to real blocks before unwritten extent conversionChristoph Hellwig2016-10-111-2/+2
| | | | | | | | | | | | | We need to splice COW blocks we've completed in xfs_end_io_direct_write into the data fork before converting unwritten extents. Otherwise xfs_bmapi_write might first allocate blocks for any holes in the data fork, which isn't only not needed but also harmful as it might cause reserved block underruns in the transaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* xfs: implement CoW for directio writesDarrick J. Wong2016-10-061-7/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | For O_DIRECT writes to shared blocks, we have to CoW them just like we would with buffered writes. For writes that are not block-aligned, just bounce them to the page cache. For block-aligned writes, however, we can do better than that. Use the same mechanisms that we employ for buffered CoW to set up a delalloc reservation, allocate all the blocks at once, issue the writes against the new blocks and use the same ioend functions to remap the blocks after the write. This should be fairly performant. Christoph discovered that xfs_reflink_allocate_cow_range may stumble over invalid entries in the extent array given that it drops the ilock but still expects the index to be stable. Simple fixing it to a new lookup for every iteration still isn't correct given that xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and there is nothing preventing a xfs_bunmapi_cow call removing extents once we dropped the ilock either. This patch duplicates the inner loop of xfs_bmapi_allocate into a helper for xfs_reflink_allocate_cow_range so that it can be done under the same ilock critical section as our CoW fork delayed allocation. The directio CoW warts will be revisited in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* xfs: report shared extent mappings to userspace correctlyDarrick J. Wong2016-10-061-0/+11
| | | | | | | | | | | | | Report shared extents through the iomap interface so that FIEMAP flags shared blocks accurately. Have xfs_vm_bmap return zero for reflinked files because the bmap-based swap code requires static block mappings, which is incompatible with copy on write. NOTE: Existing userspace bmap users such as lilo will have the same problem with reflink files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: move mappings from cow fork to data fork after copy-writeDarrick J. Wong2016-10-051-2/+22
| | | | | | | | | | | | After the write component of a copy-write operation finishes, clean up the bookkeeping left behind. On error, we simply free the new blocks and pass the error up. If we succeed, however, then we must remove the old data fork mapping and move the cow fork mapping to the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: Call the CoW failure function during xfs_cancel_ioend] Signed-off-by: Christoph Hellwig <hch@lst.de>
* xfs: allocate delayed extents in CoW forkDarrick J. Wong2016-10-051-18/+79
| | | | | | | | | | | | | | | | | Modify the writepage handler to find and convert pending delalloc extents to real allocations. Furthermore, when we're doing non-cow writes to a part of a file that already has a CoW reservation (the cowextsz hint that we set up in a subsequent patch facilitates this), promote the write to copy-on-write so that the entire extent can get written out as a single extent on disk, thereby reducing post-CoW fragmentation. Christoph moved the CoW support code in _map_blocks to a separate helper function, refactored other functions, and reduced the number of CoW fork lookups, so I merged those changes here to reduce churn. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* xfs: support allocating delayed extents in CoW forkDarrick J. Wong2016-10-051-2/+4
| | | | | | | | | | Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed allocation extents in the CoW fork to real allocations, and wire this up all the way back to xfs_iomap_write_allocate(). In a subsequent patch, we'll modify the writepage handler to call this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: refactor xfs_setfilesizeChristoph Hellwig2016-09-191-10/+21
| | | | | | | | | | | Rename the current function to __xfs_setfilesize and add a non-static wrapper that also takes care of creating the transaction. This new helper will be used by the new iomap-based DAX path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* Merge tag 'xfs-for-linus-4.8-rc1' of ↵Linus Torvalds2016-07-271-284/+48Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "The major addition is the new iomap based block mapping infrastructure. We've been kicking this about locally for years, but there are other filesystems want to use it too (e.g. gfs2). Now it is fully working, reviewed and ready for merge and be used by other filesystems. There are a lot of other fixes and cleanups in the tree, but those are XFS internal things and none are of the scale or visibility of the iomap changes. See below for details. I am likely to send another pull request next week - we're just about ready to merge some new functionality (on disk block->owner reverse mapping infrastructure), but that's a huge chunk of code (74 files changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that separate to all the "normal" pull request changes so they don't get lost in the noise. Summary of changes in this update: - generic iomap based IO path infrastructure - generic iomap based fiemap implementation - xfs iomap based Io path implementation - buffer error handling fixes - tracking of in flight buffer IO for unmount serialisation - direct IO and DAX io path separation and simplification - shortform directory format definition changes for wider platform compatibility - various buffer cache fixes - cleanups in preparation for rmap merge - error injection cleanups and fixes - log item format buffer memory allocation restructuring to prevent rare OOM reclaim deadlocks - sparse inode chunks are now fully supported" * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits) xfs: remove EXPERIMENTAL tag from sparse inode feature xfs: bufferhead chains are invalid after end_page_writeback xfs: allocate log vector buffers outside CIL context lock libxfs: directory node splitting does not have an extra block xfs: remove dax code from object file when disabled xfs: skip dirty pages in ->releasepage() xfs: remove __arch_pack xfs: kill xfs_dir2_inou_t xfs: kill xfs_dir2_sf_off_t xfs: split direct I/O and DAX path xfs: direct calls in the direct I/O path xfs: stop using generic_file_read_iter for direct I/O xfs: split xfs_file_read_iter into buffered and direct I/O helpers xfs: remove s_maxbytes enforcement in xfs_file_read_iter xfs: kill ioflags xfs: don't pass ioflags around in the ioctl path xfs: track and serialize in-flight async buffers against unmount xfs: exclude never-released buffers from buftarg I/O accounting xfs: don't reset b_retries to 0 on every failure xfs: remove extraneous buffer flag changes ...
| * Merge branch 'xfs-4.8-misc-fixes-4' into for-nextDave Chinner2016-07-221-3/+26
| |\
| | * xfs: bufferhead chains are invalid after end_page_writebackDave Chinner2016-07-221-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_finish_page_writeback(), we have a loop that looks like this: do { if (off < bvec->bv_offset) goto next_bh; if (off > end) break; bh->b_end_io(bh, !error); next_bh: off += bh->b_size; } while ((bh = bh->b_this_page) != head); The b_end_io function is end_buffer_async_write(), which will call end_page_writeback() once all the buffers have marked as no longer under IO. This issue here is that the only thing currently protecting both the bufferhead chain and the page from being reclaimed is the PageWriteback state held on the page. While we attempt to limit the loop to just the buffers covered by the IO, we still read from the buffer size and follow the next pointer in the bufferhead chain. There is no guarantee that either of these are valid after the PageWriteback flag has been cleared. Hence, loops like this are completely unsafe, and result in use-after-free issues. One such problem was caught by Calvin Owens with KASAN: ..... INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1 free_buffer_head+0x41/0x90 __slab_free+0x1ed/0x340 kmem_cache_free+0x270/0x300 free_buffer_head+0x41/0x90 try_to_free_buffers+0x171/0x240 xfs_vm_releasepage+0xcb/0x3b0 try_to_release_page+0x106/0x190 shrink_page_list+0x118e/0x1a10 shrink_inactive_list+0x42c/0xdf0 shrink_zone_memcg+0xa09/0xfa0 shrink_zone+0x2c3/0xbc0 ..... Call Trace: <IRQ> [<ffffffff81e8b8e4>] dump_stack+0x68/0x94 [<ffffffff8153a995>] print_trailer+0x115/0x1a0 [<ffffffff81541174>] object_err+0x34/0x40 [<ffffffff815436e7>] kasan_report_error+0x217/0x530 [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50 [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0 [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220 [<ffffffff81de0c58>] bio_endio+0x158/0x1b0 [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80 [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0 [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690 [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0 [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340 Where the access is occuring during IO completion after the buffer had been freed from direct memory reclaim. Prevent use-after-free accidents in this end_io processing loop by pre-calculating the loop conditionals before calling bh->b_end_io(). The loop is already limited to just the bufferheads covered by the IO in progress, so the offset checks are sufficient to prevent accessing buffers in the chain after end_page_writeback() has been called by the the bh->b_end_io() callout. Yet another example of why Bufferheads Must Die. cc: <stable@vger.kernel.org> # 4.7 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-and-Tested-by: Calvin Owens <calvinowens@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: skip dirty pages in ->releasepage()Brian Foster2016-07-221-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS has had scattered reports of delalloc blocks present at ->releasepage() time. This results in a warning with a stack trace similar to the following: ... Call Trace: [<ffffffffa23c5b8f>] dump_stack+0x63/0x84 [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0 [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140 [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0 [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150 [<ffffffffa21521c2>] try_to_release_page+0x32/0x50 [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0 [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0 [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0 [<ffffffffa2168539>] kswapd+0x4f9/0x970 [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0 [<ffffffffa20a0d99>] kthread+0xc9/0xe0 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70 [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100 This occurs because it is possible for shrink_active_list() to send pages marked dirty to ->releasepage() when certain buffer_head threshold conditions are met. shrink_active_list() doesn't check the page dirty state apparently to handle an old ext3 corner case where in some cases clean pages would not have the dirty bit cleared, thus it is up to the filesystem to determine how to handle the page. XFS currently handles the delalloc case properly, but this behavior makes the warning spurious. Update the XFS ->releasepage() handler to explicitly skip dirty pages. Retain the existing delalloc/unwritten checks so we continue to warn if such buffers exist on clean pages when they shouldn't. Diagnosed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-4.8-split-dax-dio' into for-nextDave Chinner2016-07-201-19/+5Star
| |\|
| | * xfs: direct calls in the direct I/O pathChristoph Hellwig2016-07-201-19/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We control both the callers and callees of ->direct_IO, so remove the indirect calls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: remove buffered write support from __xfs_get_blocksChristoph Hellwig2016-06-211-52/+19Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: implement iomap based buffered write pathChristoph Hellwig2016-06-211-212/+0Star
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert XFS to use the new iomap based multipage write path. This involves implementing the ->iomap_begin and ->iomap_end methods, and switching the buffered file write, page_mkwrite and xfs_iozero paths to the new iomap helpers. With this change __xfs_get_blocks will never be used for buffered writes, and the code handling them can be removed. Based on earlier code from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | xfs: use bio op accessorsMike Christie2016-06-071-8/+4Star
| | | | | | | | | | | | | | | | Separate the op from the rq_flag_bits and have xfs set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block/fs/drivers: remove rw argument from submit_bioMike Christie2016-06-071-5/+10
|/ | | | | | | | | | | | This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds2016-05-261-177/+176Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
| * Merge branch 'xfs-4.7-trans-type-cleanup' into for-nextDave Chinner2016-05-201-13/+6Star
| |\
| | * xfs: better xfs_trans_alloc interfaceChristoph Hellwig2016-04-061-13/+6Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge xfs_trans_reserve and xfs_trans_alloc into a single function call that returns a transaction with all the required log and block reservations, and which allows passing transaction flags directly to avoid the cumbersome _xfs_trans_alloc interface. While we're at it we also get rid of the transaction type argument that has been superflous since we stopped supporting the non-CIL logging mode. The guts of it will be removed in another patch. [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: fix warning in xfs_finish_page_writeback for non-debug buildsChristoph Hellwig2016-05-201-3/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | blockmask is unused if ASSERTs are disabled. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | xfs: optimize bio handling in the buffer writeback pathChristoph Hellwig2016-04-061-137/+110Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements two closely related changes: First it embeds a bio the ioend structure so that we don't have to allocate one separately. Second it uses the block layer bio chaining mechanism to chain additional bios off this first one if needed instead of manually accounting for multiple bio completions in the ioend structure. Together this removes a memory allocation per ioend and greatly simplifies the ioend setup and I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: don't release bios on completion immediatelyDave Chinner2016-04-061-27/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Completion of an ioend requires us to walk the bufferhead list to end writback on all the bufferheads. This, in turn, is needed so that we can end writeback on all the pages we just did IO on. To remove our dependency on bufferheads in writeback, we need to turn this around the other way - we need to walk the pages we've just completed IO on, and then walk the buffers attached to the pages and complete their IO. In doing this, we remove the requirement for the ioend to track bufferheads directly. To enable IO completion to walk all the pages we've submitted IO on, we need to keep the bios that we used for IO around until the ioend has been completed. We can do this simply by chaining the bios to the ioend at completion time, and then walking their pages directly just before destroying the ioend. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: changed the xfs_finish_page_writeback calling convention] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: build bios directly in xfs_add_to_ioendDave Chinner2016-04-061-38/+31Star
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently adding a buffer to the ioend and then building a bio from the buffer list are two separate operations. We don't build the bios and submit them until the ioend is submitted, and this places a fixed dependency on bufferhead chaining in the ioend. The first step to removing the bufferhead chaining in the ioend is on the IO submission side. We can build the bio directly as we add the buffers to the ioend chain, thereby removing the need for a latter "buffer-to-bio" submission loop. This allows us to submit bios on large ioends as soon as we cannot add more data to the bio. These bios then get captured by the active plug, and hence will be dispatched as soon as either the plug overflows or we schedule away from the writeback context. This will reduce submission latency for large IOs, but will also allow more timely request queue based writeback blocking when the device becomes congested. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: various small updates] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | direct-io: eliminate the offset argument to ->direct_IOChristoph Hellwig2016-05-021-4/+3Star
| | | | | | | | | | | | | | | | Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov2016-04-041-2/+2
| | | | | | | | | | | | | | | | | | Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-041-9/+9
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'xfs-for-linus-4.6-rc1' of ↵Linus Torvalds2016-03-211-639/+377Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There's quite a lot in this request, and there's some cross-over with ext4, dax and quota code due to the nature of the changes being made. As for the rest of the XFS changes, there are lots of little things all over the place, which add up to a lot of changes in the end. The major changes are that we've reduced the size of the struct xfs_inode by ~100 bytes (gives an inode cache footprint reduction of >10%), the writepage code now only does a single set of mapping tree lockups so uses less CPU, delayed allocation reservations won't overrun under random write loads anymore, and we added compile time verification for on-disk structure sizes so we find out when a commit or platform/compiler change breaks the on disk structure as early as possible. Change summary: - error propagation for direct IO failures fixes for both XFS and ext4 - new quota interfaces and XFS implementation for iterating all the quota IDs in the filesystem - locking fixes for real-time device extent allocation - reduction of duplicate information in the xfs and vfs inode, saving roughly 100 bytes of memory per cached inode. - buffer flag cleanup - rework of the writepage code to use the generic write clustering mechanisms - several fixes for inode flag based DAX enablement - rework of remount option parsing - compile time verification of on-disk format structure sizes - delayed allocation reservation overrun fixes - lots of little error handling fixes - small memory leak fixes - enable xfsaild freezing again" * tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits) xfs: always set rvalp in xfs_dir2_node_trim_free xfs: ensure committed is initialized in xfs_trans_roll xfs: borrow indirect blocks from freed extent when available xfs: refactor delalloc indlen reservation split into helper xfs: update freeblocks counter after extent deletion xfs: debug mode forced buffered write failure xfs: remove impossible condition xfs: check sizes of XFS on-disk structures at compile time xfs: ioends require logically contiguous file offsets xfs: use named array initializers for log item dumping xfs: fix computation of inode btree maxlevels xfs: reinitialise per-AG structures if geometry changes during recovery xfs: remove xfs_trans_get_block_res xfs: fix up inode32/64 (re)mount handling xfs: fix format specifier , should be %llx and not %llu xfs: sanitize remount options xfs: convert mount option parsing to tokens xfs: fix two memory leaks in xfs_attr_list.c error paths xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared ...
| * Merge branch 'xfs-misc-fixes-4.6-4' into for-nextDave Chinner2016-03-151-1/+8
| |\
| | * xfs: debug mode forced buffered write failureBrian Foster2016-03-151-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a DEBUG mode-only sysfs knob to enable forced buffered write failure. An additional side effect of this mode is brute force killing of delayed allocation blocks in the range of the write. The latter is the prime motiviation behind this patch, as userspace test infrastructure requires a reliable mechanism to create and split delalloc extents without causing extent conversion. Certain fallocate operations (i.e., zero range) were used for this in the past, but the implementations have changed such that delalloc extents are flushed and converted to real blocks, rendering the test useless. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-writepage-rework-4.6' into for-nextDave Chinner2016-03-061-466/+267Star
| |\ \
| | * | xfs: ioends require logically contiguous file offsetsDarrick J. Wong2016-03-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to create a new ioend if the current writepage call isn't logically contiguous with the range contained in the previous ioend. Hopefully writepage gets called in order of increasing file offset. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: don't chain ioends during writepage submissionDave Chinner2016-02-151-123/+118Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we can build a long ioend chain during ->writepages that gets attached to the writepage context. IO submission only then occurs when we finish all the writepage processing. This means we can have many ioends allocated and pending, and this violates the mempool guarantees that we need to give about forwards progress. i.e. we really should only have one ioend being built at a time, otherwise we may drain the mempool trying to allocate a new ioend and that blocks submission, completion and freeing of ioends that are already in progress. To prevent this situation from happening, we need to submit ioends for IO as soon as they are ready for dispatch rather than queuing them for later submission. This means the ioends have bios built immediately and they get queued on any plug that is current active. Hence if we schedule away from writeback, the ioends that have been built will make forwards progress due to the plug flushing on context switch. This will also prevent context switches from creating unnecessary IO submission latency. We can't completely avoid having nested IO allocation - when we have a block size smaller than a page size, we still need to hold the ioend submission until after we have marked the current page dirty. Hence we may need multiple ioends to be held while the current page is completely mapped and made ready for IO dispatch. We cannot avoid this problem - the current code already has this ioend chaining within a page so we can mostly ignore that it occurs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: factor mapping out of xfs_do_writepageDave Chinner2016-02-151-110/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Separate out the bufferhead based mapping from the writepage code so that we have a clear separation of the page operations and the bufferhead state. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: xfs_cluster_write is redundantDave Chinner2016-02-151-208/+6Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_cluster_write() is not necessary now that xfs_vm_writepages() aggregates writepage calls across a single mapping. This means we no longer need to do page lookups in xfs_cluster_write, so writeback only needs to look up th epage cache once per page being written. This also removes a large amount of mostly duplicate code between xfs_do_writepage() and xfs_convert_page(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: Introduce writeback context for writepagesDave Chinner2016-02-151-105/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_vm_writepages() calls generic_writepages to writeback a range of a file, but then xfs_vm_writepage() clusters pages itself as it does not have any context it can pass between->writepage calls from __write_cache_pages(). Introduce a writeback context for xfs_vm_writepages() and call __write_cache_pages directly with our own writepage callback so that we can pass that context to each writepage invocation. This encapsulates the current mapping, whether it is valid or not, the current ioend and it's IO type and the ioend chain being built. This requires us to move the ioend submission up to the level where the writepage context is declared. This does mean we do not submit IO until we packaged the entire writeback range, but with the block plugging in the writepages call this is the way IO is submitted, anyway. It also means that we need to handle discontiguous page ranges. If the pages sent down by write_cache_pages to the writepage callback are discontiguous, we need to detect this and put each discontiguous page range into individual ioends. This is needed to ensure that the ioend accurately represents the range of the file that it covers so that file size updates during IO completion set the size correctly. Failure to take into account the discontiguous ranges results in files being too small when writeback patterns are non-sequential. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: remove xfs_cancel_ioendDave Chinner2016-02-151-47/+46Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have code to cancel ioends being built because we change bufferhead state as we build the ioend. On error, this needs to be unwound and so we have cancelling code that walks the buffers on the ioend chain and undoes these state changes. However, the IO submission path already handles state changes for buffers when a submission error occurs, so we don't really need a separate cancel function to do this - we can simply submit the ioend chain with the specific error and it will be cancelled rather than submitted. Hence we can remove the explicit cancel code and just rely on submission to deal with the error correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: remove nonblocking mode from xfs_vm_writepageDave Chinner2016-02-151-17/+3Star
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the nonblocking optimisation done for mapping lookups during writeback. It's not clear that leaving a hole in the writeback range just because we couldn't get a lock is really a win, as it makes us do another small random IO later on rather than a large sequential IO now. As this gets in the way of sane error handling later on, just remove for the moment and we can re-introduce an equivalent optimisation in future if we see problems due to extent map lock contention. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | Merge branch 'xfs-misc-fixes-4.6' into for-nextDave Chinner2016-03-061-6/+17
| |\ \
| | * | xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdownBrian Foster2016-02-081-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the filesystem has shut down, xfs_end_io() currently sets an error on the ioend and proceeds to ioend destruction. The ioend might contain a truncate transaction if the I/O extended the size of the file. This transaction is only cleaned up in xfs_setfilesize_ioend(), however, which is skipped in this case. This results in an xfs_log_ticket leak message when the associate cache slab is destroyed (e.g., on rmmod). This was originally reproduced by xfs/141 on a distro kernel. The problem is reproducible on an upstream kernel, but not easily detected in current upstream if the xfs_log_ticket cache happens to be merged with another cache. This can be reproduced more deterministically with the 'slab_nomerge' kernel boot option. Update xfs_end_io() to proceed with normal end I/O processing after an error is set on an ioend due to fs shutdown. The I/O type-based processing is already designed to handle an I/O error and ensure that the ioend is cleaned up correctly. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * | xfs: clean up unwritten buffers on write failureBrian Foster2016-02-081-3/+12
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfs_vm_write_failed() handler is currently responsible for cleaning up any delalloc blocks over the range of a failed write beyond EOF. Failure to do so results in warning messages and other inconsistencies between buffer and extent state. The ->releasepage() handler currently warns in the event of a page being released with either unwritten or delalloc buffers, as neither is ever expected by the time a page is released. As has been reproduced by generic/083 on a -bsize=1k fs, it is currently possible to trigger the ->releasepage() warning for a page with unwritten buffers when a filesystem is near ENOSPC. This is reproduced by the following sequence: $ mkfs.xfs -f -b size=1k -d size=100m <dev> $ mount <dev> /mnt/ $ $ xfs_io -fc "falloc -k 0 1k" /mnt/file $ dd if=/dev/zero of=/mnt/enospc conv=notrunc oflag=append $ $ xfs_io -c "pwrite 512 1k" /mnt/file $ xfs_io -d -c "pwrite 16k 1k" /mnt/file The first pwrite command attempts a block unaligned write across an unwritten block and a hole. The delalloc for the hole fails with ENOSPC and the subsequent error handling does not clean up the unwritten buffer that was instantiated during the first ->get_block() call. The second pwrite triggers a warning as part of the inode mapping invalidation that occurs prior to direct I/O. The releasepage() handler detects the unwritten buffer at this time, warns and prevents the release of the page. To deal with this problem, update xfs_vm_write_failed() to clean up unwritten as well as delalloc buffers that are beyond EOF and within the range of the failed write. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: fold xfs_vm_do_dio into xfs_vm_direct_IOChristoph Hellwig2016-02-081-22/+14Star
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | xfs: don't use ioends for direct write completionsChristoph Hellwig2016-02-081-145/+71Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only need to communicate two bits of information to the direct I/O completion handler: (1) do we need to convert any unwritten extents in the range (2) do we need to check if we need to update the inode size based on the range passed to the completion handler We can use the private data passed to the get_block handler and the completion handler as a simple bitmask to communicate this information instead of the current complicated infrastructure reusing the ioends from the buffer I/O path, and thus avoiding a memory allocation and a context switch for any non-trivial direct write. As a nice side effect we also decouple the direct I/O path implementation from that of the buffered I/O path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | direct-io: always call ->end_io if non-NULLChristoph Hellwig2016-02-081-6/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | This way we can pass back errors to the file system, and allow for cleanup required for all direct I/O invocations. Also allow the ->end_io handlers to return errors on their own, so that I/O completion errors can be passed on to the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
* | mm: simplify lock_page_memcg()Johannes Weiner2016-03-161-4/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that migration doesn't clear page->mem_cgroup of live pages anymore, it's safe to make lock_page_memcg() and the memcg stat functions take pages, and spare the callers from memcg objects. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | mm: memcontrol: generalize locking for the page->mem_cgroup bindingJohannes Weiner2016-03-161-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These patches tag the page cache radix tree eviction entries with the memcg an evicted page belonged to, thus making per-cgroup LRU reclaim work properly and be as adaptive to new cache workingsets as global reclaim already is. This should have been part of the original thrash detection patch series, but was deferred due to the complexity of those patches. This patch (of 5): So far the only sites that needed to exclude charge migration to stabilize page->mem_cgroup have been per-cgroup page statistics, hence the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection will add another site that needs to ensure page->mem_cgroup lifetime. Rename these locking functions to the more generic lock_page_memcg() and unlock_page_memcg(). Since charge migration is a cgroup1 feature only, we might be able to delete it at some point, and these now easy to identify locking sites along with it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | dax: move writeback calls into the filesystemsRoss Zwisler2016-02-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | dax: give DAX clearing code correct bdevRoss Zwisler2016-02-271-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>