summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* block_dev: don't update file access position for sync direct IOShaohua Li2016-12-141-4/+1Star
| | | | | | | | | | | For sync direct IO, generic_file_direct_write/generic_file_read_iter will update file access position. Don't duplicate the update in .direct_IO. This cause my raid array can't assemble. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block_dev: don't test bdev->bd_contains when it is not stableNeilBrown2016-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdev->bd_contains is not stable before calling __blkdev_get(). When __blkdev_get() is called on a parition with ->bd_openers == 0 it sets bdev->bd_contains = bdev; which is not correct for a partition. After a call to __blkdev_get() succeeds, ->bd_openers will be > 0 and then ->bd_contains is stable. When FMODE_EXCL is used, blkdev_get() calls bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim() This call happens before __blkdev_get() is called, so ->bd_contains is not stable. So bd_may_claim() cannot safely use ->bd_contains. It currently tries to use it, and this can lead to a BUG_ON(). This happens when a whole device is already open with a bd_holder (in use by dm in my particular example) and two threads race to open a partition of that device for the first time, one opening with O_EXCL and one without. The thread that doesn't use O_EXCL gets through blkdev_get() to __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev; Immediately thereafter the other thread, using FMODE_EXCL, calls bd_start_claiming() from blkdev_get(). This should fail because the whole device has a holder, but because bdev->bd_contains == bdev bd_may_claim() incorrectly reports success. This thread continues and blocks on bd_mutex. The first thread then sets bdev->bd_contains correctly and drops the mutex. The thread using FMODE_EXCL then continues and when it calls bd_may_claim() again in: BUG_ON(!bd_may_claim(bdev, whole, holder)); The BUG_ON fires. Fix this by removing the dependency on ->bd_contains in bd_may_claim(). As bd_may_claim() has direct access to the whole device, it can simply test if the target bdev is the whole device. Fixes: 6b4517a7913a ("block: implement bd_claiming and claiming block") Cc: stable@vger.kernel.org (v2.6.35+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'for-linus-4.10-rc0-tag' of ↵Linus Torvalds2016-12-142-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "Xen features and fixes for 4.10 These are some fixes, a move of some arm related headers to share them between arm and arm64 and a series introducing a helper to make code more readable. The most notable change is David stepping down as maintainer of the Xen hypervisor interface. This results in me sending you the pull requests for Xen related code from now on" * tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits) xen/balloon: Only mark a page as managed when it is released xenbus: fix deadlock on writes to /proc/xen/xenbus xen/scsifront: don't request a slot on the ring until request is ready xen/x86: Increase xen_e820_map to E820_X_MAX possible entries x86: Make E820_X_MAX unconditionally larger than E820MAX xen/pci: Bubble up error and fix description. xen: xenbus: set error code on failure xen: set error code on failures arm/xen: Use alloc_percpu rather than __alloc_percpu arm/arm64: xen: Move shared architecture headers to include/xen/arm xen/events: use xen_vcpu_id mapping for EVTCHNOP_status xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing xen-scsifront: Add a missing call to kfree MAINTAINERS: update XEN HYPERVISOR INTERFACE xenfs: Use proc_create_mount_point() to create /proc/xen xen-platform: use builtin_pci_driver xen-netback: fix error handling output xen: make use of xenbus_read_unsigned() in xenbus xen: make use of xenbus_read_unsigned() in xen-pciback xen: make use of xenbus_read_unsigned() in xen-fbfront ...
| * xenfs: Use proc_create_mount_point() to create /proc/xenSeth Forshee2016-11-172-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Mounting proc in user namespace containers fails if the xenbus filesystem is mounted on /proc/xen because this directory fails the "permanently empty" test. proc_create_mount_point() exists specifically to create such mountpoints in proc but is currently proc-internal. Export this interface to modules, then use it in xenbus when creating /proc/xen. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
* | Merge tag 'driver-core-4.10-rc1' of ↵Linus Torvalds2016-12-131-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the new driver core patches for 4.10-rc1. Big thing here is the nice addition of "functional dependencies" to the driver core. The idea has been talked about for a very long time, great job to Rafael for stepping up and implementing it. It's been tested for longer than the 4.9-rc1 date, we held off on merging it earlier in order to feel more comfortable about it. Other than that, it's just a handful of small other patches, some good cleanups to the mess that is the firmware class code, and we have a test driver for the deferred probe logic. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (30 commits) firmware: Correct handling of fw_state_wait() return value driver core: Silence device links sphinx warning firmware: remove warning at documentation generation time drivers: base: dma-mapping: Fix typo in dmam_alloc_non_coherent comments driver core: test_async: fix up typo found by 0-day firmware: move fw_state_is_done() into UHM section firmware: do not use fw_lock for fw_state protection firmware: drop bit ops in favor of simple state machine firmware: refactor loading status firmware: fix usermode helper fallback loading driver core: firmware_class: convert to use class_groups driver core: devcoredump: convert to use class_groups driver core: class: add class_groups support kernfs: Declare two local data structures static driver-core: fix platform_no_drv_owner.cocci warnings drivers/base/memory.c: Remove unused 'first_page' variable driver core: add CLASS_ATTR_WO() drivers: base: cacheinfo: support DT overrides for cache properties drivers: base: cacheinfo: add pr_fmt logging drivers: base: cacheinfo: fix boot error message when acpi is enabled ...
| * | kernfs: Declare two local data structures staticBart Van Assche2016-11-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | This was spotted by the 'sparse' static checker. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds2016-12-1355-170/+413
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
| * | | block: protect iterate_bdevs() against concurrent closeRabin Vincent2016-12-011-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a block device is closed while iterate_bdevs() is handling it, the following NULL pointer dereference occurs because bdev->b_disk is NULL in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in turn called by the mapping_cap_writeback_dirty() call in __filemap_fdatawrite_range()): BUG: unable to handle kernel NULL pointer dereference at 0000000000000508 IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 PGD 9e62067 PUD 9ee8067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000 RIP: 0010:[<ffffffff81314790>] [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 RSP: 0018:ffff880009f5fe68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38 FS: 00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0 Stack: ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff 0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001 ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f Call Trace: [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90 [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30 [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20 [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130 [<ffffffff811b2763>] sys_sync+0x63/0x90 [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d RIP [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20 RSP <ffff880009f5fe68> CR2: 0000000000000508 ---[ end trace 2487336ceb3de62d ]--- The crash is easily reproducible by running the following command, if an msleep(100) is inserted before the call to func() in iterate_devs(): while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done Fix it by holding the bd_mutex across the func() call and only calling func() if the bdev is opened. Cc: stable@vger.kernel.org Fixes: 5c0d6b60a0ba ("vfs: Create function for iterating over block devices") Reported-and-tested-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: logfs: remove unnecesary checkMing Lei2016-11-221-1/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check on bio->bi_vcnt doesn't make sense in erase_end_io(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: logfs: use bio_add_page() in do_erase()Ming Lei2016-11-221-29/+15Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also code gets simplified a bit. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: logfs: use bio_add_page() in __bdev_writeseg()Ming Lei2016-11-221-31/+20Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also this patch simplify the code a bit. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: logfs: convert to bio_add_page() in sync_request()Ming Lei2016-11-221-5/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Always bio_add_page() is the standard and preferred way to do the task. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: bio: pass bvec table to bio_init()Ming Lei2016-11-222-6/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some drivers often use external bvec table, so introduce this helper for this case. It is always safe to access the bio->bi_io_vec in this way for this case. After converting to this usage, it will becomes a bit easier to evaluate the remaining direct access to bio->bi_io_vec, so it can help to prepare for the following multipage bvec support. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixed up the new O_DIRECT cases. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block_dev: get rid of blksize bits calculationJens Axboe2016-11-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We store the bits in the bdev sector size locally, but we don't use the calculation anymore. All we do with it is shift it back up to the bdev sector size. So let's just use that directly and kill the variable and bits calculation. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block_dev: Fixed direct I/O bio sector calculationDamien Le Moal2016-11-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A direct I/O alignment must be always checked against the device blocks size, but the I/O offset (bio->bi_iter.bi_sector must always use 512B sector unit, and not the actual logical block size. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: new direct I/O implementationChristoph Hellwig2016-11-171-4/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to the simple fast path, but we now need a dio structure to track multiple-bio completions. It's basically a cut-down version of the new iomap-based direct I/O code for filesystems, but without all the logic to call into the filesystem for extent lookup or allocation, and without the complex I/O completion workqueue handler for AIO - instead we just use the FUA bit on the bios to ensure data is flushed to stable storage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: make __blkdev_direct_IO_sync() support O_SYNC/DSYNCJens Axboe2016-11-171-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | Split the op setting code into a helper, use it in both places. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: support a full bio worth of IO for simplified bdev direct-ioJens Axboe2016-11-171-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | Just alloc the bio_vec array if we exceed the inline limit. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: fast-path for small and simple direct I/O requestsChristoph Hellwig2016-11-171-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a small and simple fast patch for small direct I/O requests on block devices that don't use AIO. Between the neat bio_iov_iter_get_pages helper that avoids allocating a page array for get_user_pages and the on-stack bio and biovec this avoid memory allocations and atomic operations entirely in the direct I/O code (lower levels might still do memory allocations and will usually have at least some atomic operations, though). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-By: Stephen Bates <sbates@raithlin.com> Reviewed-By: Stephen Bates <sbates@raithlin.com>
| * | | block: move poll code to blk-mqJens Axboe2016-11-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | | writeback: add wbc_to_write_flags()Jens Axboe2016-11-026-12/+7Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | | mm: only include blk_types in swap.h if CONFIG_SWAP is enabledChristoph Hellwig2016-11-014-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block,fs: untangle fs.h and blk_types.hChristoph Hellwig2016-11-0114-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing in fs.h should require blk_types.h to be included. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block,fs: use REQ_* flags directlyChristoph Hellwig2016-11-0133-86/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | btrfs: use op_is_sync to check for synchronous requestsChristoph Hellwig2016-11-012-2/+2
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: better op and flags encodingChristoph Hellwig2016-10-284-6/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge tag 'pstore-v4.10-rc1' of ↵Linus Torvalds2016-12-136-126/+293
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Improvements and fixes to pstore subsystem: - add additional checks for bad platform data - remove bounce buffer in console writer - protect read/unlink race with a mutex - correctly give up during dump locking failures - increase ftrace bandwidth by splitting ftrace buffers per CPU" * tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: ramoops: add pdata NULL check to ramoops_probe pstore: Convert console write to use ->write_buf pstore: Protect unlink with read_mutex pstore: Use global ftrace filters for function trace filtering ftrace: Provide API to use global filtering for ftrace ops pstore: Clarify context field przs as dprzs pstore: improve error report for failed setup pstore: Merge per-CPU ftrace records into one pstore: Add ftrace timestamp counter ramoops: Split ftrace buffer space into per-CPU zones pstore: Make ramoops_init_przs generic for other prz arrays pstore: Allow prz to control need for locking pstore: Warn on PSTORE_TYPE_PMSG using deprecated function pstore: Make spinlock per zone instead of global pstore: Actually give up during locking failure
| * | | | ramoops: add pdata NULL check to ramoops_probeKees Cook2016-11-161-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a check for a NULL platform data, which should only be possible if a driver incorrectly sets up a probe request without also having defined the platform_data structure. This is based on a patch from Geliang Tang. Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Convert console write to use ->write_bufNamhyung Kim2016-11-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Maybe I'm missing something, but I don't know why it needs to copy the input buffer to psinfo->buf and then write. Instead we can write the input buffer directly. The only implementation that supports console message (i.e. ramoops) already does it for ftrace messages. For the upcoming virtio backend driver, it needs to protect psinfo->buf overwritten from console messages. If it could use ->write_buf method instead of ->write, the problem will be solved easily. Cc: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Protect unlink with read_mutexNamhyung Kim2016-11-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When update_ms is set, pstore_get_records() will be called when there's a new entry. But unlink can be called at the same time and might contend with the open-read-close loop. Depending on the implementation of platform driver, it may be safe or not. But I think it'd be better to protect those race in the first place. Cc: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Use global ftrace filters for function trace filteringJoel Fernandes2016-11-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pstore doesn't have any filters setup for function tracing. This has the associated overhead and may not be useful for users looking for tracing specific set of functions. ftrace's regular function trace filtering is done writing to tracing/set_ftrace_filter however this is not available if not requested. In order to be able to use this feature, the support to request global filtering introduced earlier in the series should be requested before registering the ftrace ops. Here we do the same. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Clarify context field przs as dprzsKees Cook2016-11-161-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "przs" (persistent ram zones) is a general name in the code now, so rename the Oops-dump zones to dprzs from przs. Based on a patch from Nobuhiro Iwamatsu. Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: improve error report for failed setupKees Cook2016-11-161-19/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting ramoops record sizes, sometimes it's not clear which parameters contributed to the allocation failure. This adds a per-zone name and expands the failure reports. Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Merge per-CPU ftrace records into oneJoel Fernandes2016-11-161-13/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until this patch, each of the per CPU ftrace buffers appear as a separate ftrace-ramoops-N file. In this patch we merge all the zones into one and populate a single ftrace-ramoops-0 file. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: clarified variables names, added -ENOMEM handling] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Add ftrace timestamp counterJoel Fernandes2016-11-163-37/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for merging the per CPU buffers into one buffer when we retrieve the pstore ftrace data, we store the timestamp as a counter in the ftrace pstore record. We store the CPU number as well if !PSTORE_CPU_IN_IP, in this case we shift the counter and may lose ordering there but we preserve the same record size. The timestamp counter is also racy, and not doing any locking or synchronization here results in the benefit of lower overhead. Since we don't care much here for exact ordering of function traces across CPUs, we don't synchronize and may lose some counter updates but I'm ok with that. Using trace_clock() results in much lower performance so avoid using it since we don't want accuracy in timestamp and need a rough ordering to perform merge. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: updated commit message, added comments] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | ramoops: Split ftrace buffer space into per-CPU zonesJoel Fernandes2016-11-161-17/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the RAMOOPS_FLAG_FTRACE_PER_CPU flag is passed to ramoops pdata, split the ftrace space into multiple zones depending on the number of CPUs. This speeds up the performance of function tracing by about 280% in my tests as we avoid the locking. The trade off being lesser space available per CPU. Let the ramoops user decide which option they want based on pdata flag. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: added max_ftrace_cnt to track size, added DT logic and docs] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Make ramoops_init_przs generic for other prz arraysKees Cook2016-11-161-28/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ramoops_init_przs() is hard wired only for panic dump zone array. In preparation for the ftrace zone array (one zone per-cpu) and pmsg zone array, make the function more generic to be able to handle this case. Heavily based on similar work from Joel Fernandes. Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Allow prz to control need for lockingJoel Fernandes2016-11-162-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation of not locking at all for certain buffers depending on if there's contention, make locking optional depending on the initialization of the prz. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: moved locking flag into prz instead of via caller arguments] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Warn on PSTORE_TYPE_PMSG using deprecated functionJoel Fernandes2016-11-111-4/+2Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PMSG now uses ramoops_pstore_write_buf_user() instead of ...write_buf(). Print a ratelimited warning if gets accidentally called. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: adjusted commit log and added -EINVAL return] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Make spinlock per zone instead of globalJoel Fernandes2016-11-111-6/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently pstore has a global spinlock for all zones. Since the zones are independent and modify different areas of memory, there's no need to have a global lock, so we should use a per-zone lock as introduced here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag introduced later, which splits the ftrace memory area into a single zone per CPU, it will eliminate the need for locking. In preparation for this, make the locking optional. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: updated commit message] Signed-off-by: Kees Cook <keescook@chromium.org>
| * | | | pstore: Actually give up during locking failureLi Pengcheng2016-11-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without a return after the pr_err(), dumps will collide when two threads call pstore_dump() at the same time. Signed-off-by: Liu Hailong <liuhailong5@huawei.com> Signed-off-by: Li Pengcheng <lipengcheng8@huawei.com> Signed-off-by: Li Zhong <lizhong11@hisilicon.com> [kees: improved commit message] Signed-off-by: Kees Cook <keescook@chromium.org>
* | | | | Merge tag 'docs-4.10' of git://git.lwn.net/linuxLinus Torvalds2016-12-132-3/+3
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull documentation update from Jonathan Corbet: "These are the documentation changes for 4.10. It's another busy cycle for the docs tree, as the sphinx conversion continues. Highlights include: - Further work on PDF output, which remains a bit of a pain but should be more solid now. - Five more DocBook template files converted to Sphinx. Only 27 to go... Lots of plain-text files have also been converted and integrated. - Images in binary formats have been replaced with more source-friendly versions. - Various bits of organizational work, including the renaming of various files discussed at the kernel summit. - New documentation for the device_link mechanism. ... and, of course, lots of typo fixes and small updates" * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits) dma-buf: Extract dma-buf.rst Update Documentation/00-INDEX docs: 00-INDEX: document directories/files with no docs docs: 00-INDEX: remove non-existing entries docs: 00-INDEX: add missing entries for documentation files/dirs docs: 00-INDEX: consolidate process/ and admin-guide/ description scripts: add a script to check if Documentation/00-INDEX is sane Docs: change sh -> awk in REPORTING-BUGS Documentation/core-api/device_link: Add initial documentation core-api: remove an unexpected unident ppc/idle: Add documentation for powersave=off Doc: Correct typo, "Introdution" => "Introduction" Documentation/atomic_ops.txt: convert to ReST markup Documentation/local_ops.txt: convert to ReST markup Documentation/assoc_array.txt: convert to ReST markup docs-rst: parse-headers.pl: cleanup the documentation docs-rst: fix media cleandocs target docs-rst: media/Makefile: reorganize the rules docs-rst: media: build SVG from graphviz files docs-rst: replace bayer.png by a SVG image ...
| * \ \ \ \ Merge tag 'v4.9-rc4' into soundJonathan Corbet2016-11-1965-832/+944
| |\ \ \ \ \ | | | |_|_|/ | | |/| | | | | | | | | Bring in -rc4 patches so I can successfully merge the sound doc changes.
| * | | | | docs: fix locations of several documents that got movedMauro Carvalho Chehab2016-10-242-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch renamed several files that are cross-referenced along the Kernel documentation. Adjust the links to point to the right places. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
* | | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-12-1322-82/+101
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: - various misc bits - most of MM (quite a lot of MM material is awaiting the merge of linux-next dependencies) - kasan - printk updates - procfs updates - MAINTAINERS - /lib updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits) init: reduce rootwait polling interval time to 5ms binfmt_elf: use vmalloc() for allocation of vma_filesz checkpatch: don't emit unified-diff error for rename-only patches checkpatch: don't check c99 types like uint8_t under tools checkpatch: avoid multiple line dereferences checkpatch: don't check .pl files, improve absolute path commit log test scripts/checkpatch.pl: fix spelling checkpatch: don't try to get maintained status when --no-tree is given lib/ida: document locking requirements a bit better lib/rbtree.c: fix typo in comment of ____rb_erase_color lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM MAINTAINERS: add drm and drm/i915 irc channels MAINTAINERS: add "C:" for URI for chat where developers hang out MAINTAINERS: add drm and drm/i915 bug filing info MAINTAINERS: add "B:" for URI where to file bugs get_maintainer: look for arbitrary letter prefixes in sections printk: add Kconfig option to set default console loglevel printk/sound: handle more message headers printk/btrfs: handle more message headers printk/kdb: handle more message headers ...
| * | | | | | binfmt_elf: use vmalloc() for allocation of vma_fileszJason Baron2016-12-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have observed page allocations failures of order 4 during core dump while trying to allocate vma_filesz. This results in a useless core file of size 0. To improve reliability use vmalloc(). Note that the vmalloc() allocation is bounded by sysctl_max_map_count, which is 65,530 by default. So with a 4k page size, and 8 bytes per seg, this is a max of 128 pages or an order 7 allocation. Other parts of the core dump path, such as fill_files_note() are already using vmalloc() for presumably similar reasons. Link: http://lkml.kernel.org/r/1479745791-17611-1-git-send-email-jbaron@akamai.com Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | printk/btrfs: handle more message headersPetr Mladek2016-12-131-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines") allows to define more message headers for a single message. The motivation is that continuous lines might get mixed. Therefore it make sense to define the right log level for every piece of a cont line. The current btrfs_printk() macros do not support continuous lines at the moment. But better be prepared for a custom messages and avoid potential "lvl" buffer overflow. This patch iterates over the entire message header. It is interested only into the message level like the original code. This patch also introduces PRINTK_MAX_SINGLE_HEADER_LEN. Three bytes are enough for the message level header at the moment. But it used to be three, see the commit 04d2c8c83d0e ("printk: convert the format for KERN_<LEVEL> to a 2 byte pattern"). Also I fixed the default ratelimit level. It looked very strange when it was different from the default log level. [pmladek@suse.com: Fix a check of the valid message level] Link: http://lkml.kernel.org/r/20161111183236.GD2145@dhcp128.suse.cz Link: http://lkml.kernel.org/r/1478695291-12169-4-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | fs/proc: calculate /proc/* and /proc/*/task/* nlink at init timeAlexey Dobriyan2016-12-133-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Runtime nlink calculation works but meh. I don't know how to do it at compile time, but I know how to do it at init time. Shift "2+" part into init time as a bonus. Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | fs/proc/base.c: save decrement during lookup/readdir in /proc/$PIDAlexey Dobriyan2016-12-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Comparison for "<" works equally well as comparison for "<=" but one SUB/LEA is saved (no, it is not optimised away, at least here). Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | | | fs/proc/array.c: slightly improve render_sigset_tRasmus Villemoes2016-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | format_decode and vsnprintf occasionally show up in perf top, so I went looking for places that might not need the full printf power. With the help of kprobes, I gathered some statistics on which format strings we mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25% of the time, so something apparently reads /proc/pid/status (which does 5*16 printf("%x") calls) a lot. With this patch, reading /proc/pid/status is 30% faster according to this microbenchmark: char buf[4096]; int i, fd; for (i = 0; i < 10000; ++i) { fd = open("/proc/self/status", O_RDONLY); read(fd, buf, sizeof(buf)); close(fd); } Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andrei Vagin <avagin@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>