summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | drm/imx: ipuv3-plane: Constify ipu_plane_funcsLiu Ying2016-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Liu Ying <gnuiyl@gmail.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: imx-ldb: honor 'native-mode' property when selecting video mode from DTLothar Waßmann2016-05-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to select a specific video mode from a list of modes defined in DT by setting the 'native-mode' property appropriately. This change does not affect the behaviour of existing platforms, since they either: - have just one display-timings subnode - have the native-mode property pointing to the first entry - let the bootloader select the appropriate timing Signed-off-by: Lothar Waßmann <LW@KARO-electronics.de> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: parallel-display: remove dead codeLothar Waßmann2016-05-301-12/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'mode_valid' flag is never set in this driver. Remove it and the code that depends on it. Signed-off-by: Lothar Waßmann <LW@KARO-electronics.de> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: use bus_flags for pixel clock polarityPhilipp Zabel2016-05-305-15/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows panels to set pixel clock and data enable pin polarity other than the default of driving data at the falling pixel clock edge and active high display enable. Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: ipuv3-plane: enable UYVY and VYUY formatsPhilipp Zabel2016-05-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Advertise the DRM_FORMAT_UYVY and DRM_FORMAT_VYUY formats to userspace. Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: parallel-display: use of_graph_get_endpoint_by_regs helperPhilipp Zabel2016-05-301-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using of_graph_get_port_by_id() to get the port and then of_get_child_by_name() to get the first endpoint, get to the endpoint in a single step. Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: imx-ldb: use of_graph_get_endpoint_by_regs helperPhilipp Zabel2016-05-301-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using of_graph_get_port_by_id() to get the port and then of_get_child_by_name() to get the first endpoint, get to the endpoint in a single step. Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | dt-bindings: imx: ldb: Add ddc-i2c-bus propertyAkshay Bhat2016-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the ddc-i2c-bus property used by imx-ldb driver to read EDID information via I2C interface. Signed-off-by: Akshay Bhat <akshay.bhat@timesys.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/imx: imx-ldb: Add DDC supportSteve Longerbeam2016-05-301-8/+34
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for reading EDID over Display Data Channel. If no DDC adapter is available, falls back to hardcoded EDID or display-timings node as before. Signed-off-by: Steve Longerbeam <steve_longerbeam@mentor.com> Signed-off-by: Akshay Bhat <akshay.bhat@timesys.com> Acked-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| * | | Merge tag 'drm-atmel-hlcdc-fixes/for-4.7-rc2' of ↵Dave Airlie2016-06-031-5/+5
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | github.com:bbrezillon/linux-at91 into drm-fixes Two trivial bugfixes for the atmel-hlcdc driver. The first one is making use of __drm_atomic_helper_crtc_destroy_state() instead of duplicating its logic in atmel_hlcdc_crtc_reset() and risking memory leaks if other objects are added to the common CRTC state. The second one is fixing a possible NULL pointer dereference. * tag 'drm-atmel-hlcdc-fixes/for-4.7-rc2' of github.com:bbrezillon/linux-at91: drm: atmel-hlcdc: fix a NULL check drm: atmel-hlcdc: fix atmel_hlcdc_crtc_reset() implementation
| | * | | drm: atmel-hlcdc: fix a NULL checkDan Carpenter2016-06-011-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If kmalloc() returned NULL we would end up dereferencing "state" a couple lines later. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
| | * | | drm: atmel-hlcdc: fix atmel_hlcdc_crtc_reset() implementationBoris Brezillon2016-06-011-3/+2Star
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reset crtc->state to NULL after freeing the state object and call __drm_atomic_helper_crtc_destroy_state() helper instead of manually calling drm_property_unreference_blob(). Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
| * | | Merge branch 'for-upstream/hdlcd' of git://linux-arm.org/linux-ld into drm-fixesDave Airlie2016-06-033-81/+78Star
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "I have accumulated some cleanup patches for HDLCD, partly triggered by Daniel Vetter's work on non-blocking atomic operations, that I would like to integrate into v4.7. My first patch is important for the newly enabled hibernate option for AArch64 on Juno, the others are fixing behaviour in HDLCD and adding a debugfs entry to help track the underlying framebuffer usage. I'm also taking one of Daniel's patches from his non-blocking series to help with the integration of his patches later." * 'for-upstream/hdlcd' of git://linux-arm.org/linux-ld: drm: hdlcd: Add information about the underlying framebuffers in debugfs drm: hdlcd: Cleanup the atomic plane operations drm/hdlcd: Fix up crtc_state->event handling drm: hdlcd: Revamp runtime power management
| | * | | drm: hdlcd: Add information about the underlying framebuffers in debugfsLiviu Dudau2016-06-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drm_fb_cma code has a nice helper function to display in the debugfs information about the underlying framebuffers used by HDLCD: $ cat /sys/kernel/debug/dri/0/fb fb: 1920x1200@XR24 0: offset=0 pitch=7680, obj: 0 ( 2) 001011ba 0x00000000fc300000 ffffff800a27c000 9338880 fb: 1920x1200@XR24 0: offset=0 pitch=7680, obj: 0 ( 2) 001008ca 0x00000000fba00000 ffffff8009987000 9338880 fb: 1920x1200@XR24 0: offset=0 pitch=7680, obj: 0 ( 1) 00100000 0x00000000fb100000 ffffff8008fdc000 9216000 Add the entry in HDLCD's debugfs node. Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
| | * | | drm: hdlcd: Cleanup the atomic plane operationsLiviu Dudau2016-06-022-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Harden the plane_check() code to drop attempts at scaling because that is not supported. Make hdlcd_plane_atomic_update() set the pitch and line length registers that correctly reflect the plane's values. And make hdlcd_crtc_mode_set_nofb() a helper function for hdlcd_crtc_enable() rather than an exposed hook. Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
| | * | | drm/hdlcd: Fix up crtc_state->event handlingDaniel Vetter2016-06-023-29/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | event_list just reimplemented what drm_crtc_arm_vblank_event does. And we also need to send out drm events when shutting down a pipe. With this it's possible to use the new nonblocking commit support in the helpers. Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Liviu Dudau <Liviu.Dudau@arm.com>
| | * | | drm: hdlcd: Revamp runtime power managementLiviu Dudau2016-06-023-35/+39
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because the HDLCD driver acts as a component master it can end up enabling the runtime PM functionality before the encoders are initialised. This can cause crashes if the component slave never probes (missing module) or if the PM operations kick in before the probe finishes. Move the enabling of the runtime PM after the component master has finished collecting the slave components and use the DRM atomic helpers to suspend and resume the device. Tested-by: Robin Murphy <Robin.Murphy@arm.com> Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
| * | | Merge tag 'mediatek-drm-fixes-2016-06-01' of ↵Dave Airlie2016-06-012-8/+1Star
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.pengutronix.de/git/pza/linux into drm-fixes mediatek-drm fixes - remove an invalid, unreachable error message and NULL pointer dereference - remove a spurious drm_connector_unregister call from the DSI driver * tag 'mediatek-drm-fixes-2016-06-01' of git://git.pengutronix.de/git/pza/linux: drm/mediatek: mtk_dsi: Remove spurious drm_connector_unregister drm/mediatek: mtk_dpi: remove invalid error message
| | * | | drm/mediatek: mtk_dsi: Remove spurious drm_connector_unregisterPhilipp Zabel2016-06-011-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Connectors are unregistered by mtk_drm_drv via drm_connector_unregister_all(). Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| | * | | drm/mediatek: mtk_dpi: remove invalid error messagePhilipp Zabel2016-06-011-5/+0Star
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not try to dereference dpi if it is NULL. Since dpi can never be NULL when mtk_dpi_set_display_mode() is called, remove the message. Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
| * | | drm/mgag200: Black screen fix for G200e rev 4Mathieu Larouche2016-06-011-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Fixed black screen for some resolutions of G200e rev4 - Fixed testm & testn which had predetermined value. Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Mathieu Larouche <mathieu.larouche@matrox.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm: Wrap direct calls to driver->gem_free_object from CMAChris Wilson2016-06-012-10/+4Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the introduction of (struct_mutex) lockless GEM bo freeing, there are a pair of driver vfuncs for freeing the GEM bo, of which the driver may choose to only implement driver->gem_object_free_unlocked (and so avoid taking the struct_mutex along the free path). However, the CMA GEM helpers were still calling driver->gem_free_object directly, now NULL, and promptly dying on the fancy new lockless drivers. Oops. Robert Foss bisected this to b82caafcf2303 (drm/vc4: Use lockless gem BO free callback) on his vc4 device, but that just serves as an enabler for 9f0ba539d13ae (drm/gem: support BO freeing without dev->struct_mutex). Reported-by: Robert Foss <robert.foss@collabora.com> Fixes: 9f0ba539d13ae (drm/gem: support BO freeing without dev->struct_mutex) Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Robert Foss <robert.foss@collabora.com> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Eric Anholt <eric@anholt.net> Cc: Alex Deucher <alexdeucher@gmail.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Tested-by: Robert Foss <robert.foss@collabora.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm: fix fb refcount issue with atomic modesettingTomi Valkeinen2016-06-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 027b3f8ba9277410c3191d72d1ed2c6146d8a668 ("drm/modes: stop handling framebuffer special") extra fb refs are left around when doing atomic modesetting. The problem is that the new drm_property_change_valid_get() does not return anything in the '**ref' parameter, which causes drm_property_change_valid_put() to do nothing. For some reason this doesn't cause problems with legacy API. Also, previously the code only set the 'ref' variable for fbs, with this patch the 'ref' is set for all objects. Fixes: 027b3f8ba927 ("drm/modes: stop handling framebuffer special") Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm: make drm_atomic_set_mode_prop_for_crtc() more reliableTomi Valkeinen2016-06-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drm_atomic_set_mode_prop_for_crtc() does not clear the state->mode, so old data may be left there when a new mode is set, possibly causing odd issues. This patch improves the situation by always clearing the state->mode first. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: stable@vger.kernel.org Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm/sti: remove extra mode fixupTomi Valkeinen2016-06-011-10/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 652353e6e561c2aeeac62df183f721f6f9b5b45f ("drm/sti: set CRTC modesetting parameters") added a hack to avoid warnings related to setting mode with atomic API. With the previous patch, the hack should no longer be necessary. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Benjamin Gaignard <benjamin.gaignard@linaro.org> Cc: Vincent Abriou <vincent.abriou@st.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Dave Airlie <airlied@redhat.com>
| * | | drm: add missing drm_mode_set_crtcinfo callTomi Valkeinen2016-06-012-2/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting mode via MODE_ID property, drm_atomic_set_mode_prop_for_crtc() does not call drm_mode_set_crtcinfo() which possibly causes: "[drm:drm_calc_timestamping_constants [drm]] *ERROR* crtc 32: Can't calculate constants, dotclock = 0!" Whether the error is seen depends on the previous data in state->mode, as state->mode is not cleared when setting new mode. This patch adds drm_mode_set_crtcinfo() call to drm_mode_convert_umode(), which is called in both legacy and atomic paths. This should be fine as there's no reason to call drm_mode_convert_umode() without also setting the crtc related fields. drm_mode_set_crtcinfo() is removed from the legacy drm_mode_setcrtc() as that is no longer needed. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
* | | Merge tag 'vfio-v4.7-rc2' of git://github.com/awilliam/linux-vfioLinus Torvalds2016-06-043-5/+6
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO fixes from Alex Williamson: "Fix irqfd shutdown ordering, build warning, and VPD short read" * tag 'vfio-v4.7-rc2' of git://github.com/awilliam/linux-vfio: vfio/pci: Allow VPD short read vfio/type1: Fix build warning vfio/pci: Fix ordering of eventfd vs virqfd shutdown
| * | | vfio/pci: Allow VPD short readAlex Williamson2016-06-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The size of the VPD area is not necessarily 4-byte aligned, so a pci_vpd_read() might return less than 4 bytes. Zero our buffer and accept anything other than an error. Intel X710 NICs exercise this. Fixes: 4e1a635552d3 ("vfio/pci: Use kernel VPD access functions") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | | vfio/type1: Fix build warningAlex Williamson2016-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This function cannot actually be called with npage = 0, so in practice this doesn't return an uninitialized value. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | | vfio/pci: Fix ordering of eventfd vs virqfd shutdownAlex Williamson2016-05-301-3/+3
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the trigger eventfd before calling vfio_virqfd_disable() any potential mask and unmask eventfds. This opens a use-after-free race where an inopportune irqfd can reference the freed signalling eventfd. Reorder to avoid this possibility. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* | | Merge tag 'mmc-v4.7-rc1-2' of git://git.linaro.org/people/ulf.hansson/mmcLinus Torvalds2016-06-042-9/+4Star
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MMC fixes from Ulf Hansson: "MMC core: - Fix/restore behaviour when selecting bus width for (e)MMC MMC host: - sunxi: Fix eMMC HS-DDR modes on Allwinner A80" * tag 'mmc-v4.7-rc1-2' of git://git.linaro.org/people/ulf.hansson/mmc: mmc: sunxi: Re-enable eMMC HS-DDR modes on Allwinner A80 mmc: sunxi: Fix DDR MMC timings for A80 mmc: fix mmc mode selection for HS-DDR and higher
| * | | mmc: sunxi: Re-enable eMMC HS-DDR modes on Allwinner A80Chen-Yu Tsai2016-06-021-5/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the the HS-DDR mode clock timings have been corrected, we can re-enable these modes on the A80. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
| * | | mmc: sunxi: Fix DDR MMC timings for A80Chen-Yu Tsai2016-06-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MMC clock timings were incorrectly calculated, when the conversion from delay value to delay phase was done. The 50M DDR and 50M DDR 8bit timings are off, and make eMMC DDR unusable. Unfortunately it seems different controllers on the same SoC have different timings. The new settings are taken from mmc2, which is commonly used with eMMC. The settings for the slower timing modes seem to work despite being wrong, so leave them be. Signed-off-by: Chen-Yu Tsai <wens@csie.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
| * | | mmc: fix mmc mode selection for HS-DDR and higherChen-Yu Tsai2016-06-021-2/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When IS_ERR_VALUE was removed from the mmc core code, it was replaced with a simple not-zero check. This does not work, as the value checked is the return value for mmc_select_bus_width, which returns the set bit width on success. This made eMMC modes higher than HS-DDR unusable. Fix this by checking for a positive return value instead. Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses") Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chen-Yu Tsai <wens@csie.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Jaehoon Chung <jh80.chung@samsung.com> Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com> Tested-by: Marcel Ziswiler <marcel.ziswiler@toradex.com> Tested-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
* | | Merge branch 'for-linus-4.7' of ↵Linus Torvalds2016-06-048-18/+103
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The important part of this pull is Filipe's set of fixes for btrfs device replacement. Filipe fixed a few issues seen on the list and a number he found on his own" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent Btrfs: fix race between device replace and read repair Btrfs: fix race between device replace and discard Btrfs: fix race between device replace and chunk allocation Btrfs: fix race setting block group back to RW mode during device replace Btrfs: fix unprotected assignment of the left cursor for device replace Btrfs: fix race setting block group readonly during device replace Btrfs: fix race between device replace and block group removal Btrfs: fix race between readahead and device replace/removal
| * | | Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extentChris Mason2016-06-031-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dealing with inline extents, btrfs_get_extent will incorrectly try to insert a duplicate extent_map. The dup hits -EEXIST from add_extent_map, but then we try to merge with the existing one and end up trying to insert a zero length extent_map. This actually works most of the time, except when there are extent maps past the end of the inline extent. rocksdb will trigger this sometimes because it preallocates an extent and then truncates down. Josef made a script to trigger with xfs_io: #!/bin/bash xfs_io -f -c "pwrite 0 1000" inline xfs_io -c "falloc -k 4k 1M" inline xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline xfs_io -c "fadvise -d 0 1000" inline cat inline You'll get EIOs trying to read inline after this because add_extent_map is returning EEXIST Signed-off-by: Chris Mason <clm@fb.com>
| * | | Merge branch 'dev-replace-fixes-4.7' of ↵Chris Mason2016-06-027-17/+91
| |\ \ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7
| | * | | Btrfs: fix race between device replace and read repairFilipe Manana2016-05-311-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we are finishing a device replace operation we can have a concurrent task trying to do a read repair operation, in which case it will call btrfs_map_block() to get a struct btrfs_bio which can have a stripe that points to the source device of the device replace operation. This allows for the read repair task to dereference the stripe's device pointer after the device replace operation has freed the source device, resulting in an invalid memory access. This is similar to the problem solved by my previous patch in the same series and named "Btrfs: fix race between device replace and discard". So fix this by surrounding the call to btrfs_map_block() and the code that uses the returned struct btrfs_bio with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the proper serialization with the finishing phase of the device replace operation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race between device replace and discardFilipe Manana2016-05-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we are finishing a device replace operation, we can make a discard operation (fs mounted with -o discard) do an invalid memory access like the one reported by the following trace: [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs] [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000 [ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7 [ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246 [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000 [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48 [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001 [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8 [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b [ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000 [ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0 [ 3206.388595] Stack: [ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000 [ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0 [ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0 [ 3206.388595] Call Trace: [ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs] [ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs] [ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e [ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e [ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a [ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14 [ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14 [ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa This happens because when we call btrfs_map_block() from btrfs_discard_extent() to get a btrfs_bio structure, the device replace operation has not finished yet, but before we use the device of one of the stripes from the returned btrfs_bio structure, the device object is freed. This is illustrated by the following diagram. CPU 1 CPU 2 btrfs_dev_replace_start() (...) btrfs_dev_replace_finishing() btrfs_start_transaction() btrfs_commit_transaction() (...) btrfs_sync_file() btrfs_start_transaction() (...) btrfs_commit_transaction() btrfs_finish_extent_commit() btrfs_discard_extent() btrfs_map_block() --> returns a struct btrfs_bio with a stripe that has a device field pointing to source device of the replace operation (the device that is being replaced) mutex_lock(&uuid_mutex) mutex_lock(&fs_info->fs_devices->device_list_mutex) mutex_lock(&fs_info->chunk_mutex) btrfs_dev_replace_update_device_in_mapping_tree() --> iterates the mapping tree and for each extent map that has a stripe pointing to the source device, it updates the stripe to point to the target device instead btrfs_rm_dev_replace_blocked() --> waits for fs_info->bio_counter to go down to 0 btrfs_rm_dev_replace_remove_srcdev() --> removes source device from the list of devices mutex_unlock(&fs_info->chunk_mutex) mutex_unlock(&fs_info->fs_devices->device_list_mutex) mutex_unlock(&uuid_mutex) btrfs_rm_dev_replace_free_srcdev() --> frees the source device --> iterates over all stripes of the returned struct btrfs_bio --> for each stripe it dereferences its device pointer --> it ends up finding a pointer to the device used as the source device for the replace operation and that was already freed So fix this by surrounding the call to btrfs_map_block(), and the code that uses the returned struct btrfs_bio, with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that the finishing phase of the device replace operation blocks until the the bio counter decreases to zero before it frees the source device. This is the same approach we do at btrfs_map_bio() for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race between device replace and chunk allocationFilipe Manana2016-05-301-12/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race setting block group back to RW mode during device replaceFilipe Manana2016-05-301-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix unprotected assignment of the left cursor for device replaceFilipe Manana2016-05-301-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race setting block group readonly during device replaceFilipe Manana2016-05-303-2/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1 CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X <transaction N starts> starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW mode dev_replace->cursor_left is set to the logical end offset of block group X So fix this by waiting for all cow and nocow writes after setting a block group to readonly mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race between device replace and block group removalFilipe Manana2016-05-301-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] ------------[ cut here ]------------ [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0 [ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18 [ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90 [ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd [ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc [ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37186a ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * | | Btrfs: fix race between readahead and device replace/removalFilipe Manana2016-05-301-0/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The list of devices is protected by the device_list_mutex and the device replace code, in its finishing phase correctly takes that mutex before removing the source device from that list. However the readahead code was iterating that list without acquiring the respective mutex leading to crashes later on due to invalid memory accesses: [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport processor ser [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G W 4.6.0-rc7-btrfs-next-29+ #1 [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000 [125671.834973] RIP: 0010:[<ffffffff81270479>] [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105 [125671.834973] RSP: 0018:ffff8801ac91bc28 EFLAGS: 00010206 [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000 [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8 [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000 [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8 [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff [125671.834973] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000 [125671.834973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0 [125671.834973] Stack: [125671.834973] 0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000 [125671.834973] 0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000 [125671.834973] ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0 [125671.834973] Call Trace: [125671.834973] [<ffffffff81270541>] radix_tree_lookup+0xd/0xf [125671.834973] [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs] [125671.834973] [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs] [125671.834973] [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs] [125671.834973] [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs] [125671.834973] [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs] [125671.834973] [<ffffffff81069691>] process_one_work+0x271/0x4e9 [125671.834973] [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9 [125671.834973] [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3 [125671.834973] [<ffffffff8106f403>] kthread+0xd4/0xdc [125671.834973] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [125671.834973] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 So fix this by taking the device_list_mutex in the readahead code. We can't use here the lighter approach of using a rcu_read_lock() and rcu_read_unlock() pair together with a list_for_each_entry_rcu() call because we end up doing calls to sleeping functions (kzalloc()) in the respective code path. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2016-06-0413-178/+138Star
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "We have a few follow-up fixes for the libceph refactor from Ilya, and then some cephfs + fscache fixes from Zheng. The first two FS-Cache patches are acked by David Howells and deemed trivial enough to go through our tree. The rest fix some issues with the ceph fscache handling (disable cache for inodes opened for write, and simplify the revalidation logic accordingly, dropping the now-unnecessary work queue)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use i_version to check validity of fscache ceph: improve fscache revalidation ceph: disable fscache when inode is opened for write ceph: avoid unnecessary fscache invalidation/revlidation ceph: call __fscache_uncache_page() if readpages fails FS-Cache: make check_consistency callback return int FS-Cache: wake write waiter after invalidating writes libceph: use %s instead of %pE in dout()s libceph: put request only if it's done in handle_reply() libceph: change ceph_osdmap_flag() to take osdc
| * | | | ceph: use i_version to check validity of fscacheYan, Zheng2016-06-011-0/+3
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | ceph: improve fscache revalidationYan, Zheng2016-06-014-83/+41Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several issues in fscache revalidation code. - In ceph_revalidate_work(), fscache_invalidate() is called when fscache_check_consistency() return 0. This is complete wrong because 0 means cache is valid. - Handle_cap_grant() calls ceph_queue_revalidate() if client already has CAP_FILE_CACHE. This code is confusing. Client should revalidate the cache each time it got CAP_FILE_CACHE anew. - In Handle_cap_grant(), fscache_invalidate() is called if MDS revokes CAP_FILE_CACHE. This is inconsistency with the case that inode get evicted. In the later case, the cache is not discarded. Client may use the cache when inode is reloaded. This patch moves the fscache revalidation into ceph_get_caps(). Client revalidates the cache after it gets CAP_FILE_CACHE. i_rdcache_gen should keep constance while CAP_FILE_CACHE is used. If i_fscache_gen is not equal to i_rdcache_gen, client needs to check cache's consistency. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | ceph: disable fscache when inode is opened for writeYan, Zheng2016-06-014-53/+52Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All other filesystems do not add dirty pages to fscache. They all disable fscache when inode is opened for write. Only ceph adds dirty pages to fscache, but the code is buggy. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | ceph: avoid unnecessary fscache invalidation/revlidationYan, Zheng2016-06-011-6/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ceph_fill_file_size() has already called ceph_fscache_invalidate() if it return true. Signed-off-by: Yan, Zheng <zyan@redhat.com>