summaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/i915/intel_hangcheck.c
Commit message (Collapse)AuthorAgeFilesLines
* drm/i915: Declare the driver wedged if hangcheck makes no progressChris Wilson2018-06-141-1/+16
| | | | | | | | | | | | | | | | Hangcheck is our back up in case the GPU or the driver gets stuck. It detects when the GPU is not making any progress and issues a GPU reset. However, if the driver is failing to make any progress, we can get ourselves into a situation where we continually try resetting the GPU to no avail. Employ a second timeout such that if we continue to see the same seqno (the stalled engine has made no progress at all) over the course of several hangchecks, declare the driver wedged and attempt to start afresh. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180602104853.17140-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
* drm/i915: Reset the hangcheck timestamp before repeating a seqnoChris Wilson2018-05-031-0/+1
| | | | | | | | | | | | In the unusual circumstance where we reuse a seqno (for example, in igt), make sure that we reset the hangcheck timestamp before it sees the same seqno again. References: https://bugs.freedesktop.org/show_bug.cgi?id=106215 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180502220313.6459-1-chris@chris-wilson.co.uk
* drm/i915: Compile out engine debug for releaseChris Wilson2018-04-261-1/+1
| | | | | | | | | | | | The majority of the engine state dumping is too voluminous to be useful outside of a controlled setup, though a few do accompany severe errors. Keep the debug dumps next to the errors, but hide the others behind a CI compile flag. This becomes more useful when adding more dumps to latency sensitive paths. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180426103219.22181-1-chris@chris-wilson.co.uk
* drm/i915: Add control flags to i915_handle_error()Chris Wilson2018-03-201-7/+6Star
| | | | | | | | | | | | | | | | | | | | Not all callers want the GPU error to handled in the same way, so expose a control parameter. In the first instance, some callers do not want the heavyweight error capture so add a bit to request the state to be captured and saved. v2: Pass msg down to i915_reset/i915_reset_engine so that we include the reason for the reset in the dev_notice(), superseding the earlier option to not print that notice. v3: Stash the reason inside the i915->gpu_error to handover to the direct reset from the blocking waiter. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jeff McGee <jeff.mcgee@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180320100449.1360-2-chris@chris-wilson.co.uk
* drm/i915: Specify which engines to reset following semaphore/event lockupsChris Wilson2018-03-201-2/+2
| | | | | | | | | | | | | | If the GPU is stuck waiting for an event or for a semaphore, we need to reset the GPU in order to recover. We have to tell the reset routine which engines we want reset, but we were still using the old interface and declaring it as "not-fatal". Fixes: 14b730fcb8d9 ("drm/i915/tdr: Prepare error handler to accept mask of hung engines") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180320100449.1360-1-chris@chris-wilson.co.uk
* drm/i915: Add some newlines to intel_engine_dump() headersChris Wilson2018-02-051-1/+1
| | | | | | | | | | | The headers should be on a separate line for consistency, so add the missing trailing newline in a few intel_engine_dump() callers. Reported-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180205100618.11001-1-chris@chris-wilson.co.uk
* drm/i915: Always run hangcheck while the GPU is busyChris Wilson2018-01-311-6/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we relied on only running the hangcheck while somebody was waiting on the GPU, in order to minimise the amount of time hangcheck had to run. (If nobody was watching the GPU, nobody would notice if the GPU wasn't responding -- eventually somebody would care and so kick hangcheck into action.) However, this falls apart from around commit 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion"), as not all waiters declare themselves to hangcheck and so we could switch off hangcheck and miss GPU hangs even when waiting under the struct_mutex. If we enable hangcheck from the first request submission, and let it run until the GPU is idle again, we forgo all the complexity involved with only enabling around waiters. We just have to remember to be careful that we do not declare a GPU hang when idly waiting for the next request to be come ready, as we will run hangcheck continuously even when the engines are stalled waiting for external events. This should be true already as we should only be tracking requests submitted to hardware for execution as an indicator that the engine is busy. Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion" Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104840 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180129144104.3921-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
* drm/i915: Remove pointer indirection for hangcheck_state localChris Wilson2017-12-191-5/+5
| | | | | | | | | | | | Use the local on-stack struct directly rather than hide it behind a pointer. This should be both clearer for the reader and the compiler (we rely on the compiler seeing through the functions to spot uninitialized uses of the local). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171219130948.6282-1-chris@chris-wilson.co.uk
* drm/i915: Show engine state when hangcheck detects a stallChris Wilson2017-12-141-1/+6
| | | | | | | | | | | | Knowing the state of the engine when hangcheck thinks it is stalling is useful for both debugging hangcheck itself and the potential cause of an unwanted stall. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171214122613.26134-1-chris@chris-wilson.co.uk
* drm/i915: Remove obsolete ringbuffer emission for gen8+Chris Wilson2017-11-201-32/+12Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since removing the module parameter to force selection of ringbuffer emission for gen8, the code is defunct. Remove it. To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u, 20170324): ring execlists exec continuous nops on all rings: 1.491us 2.223us exec sequential nops on each ring: 12.508us 53.682us single nop + sync: 9.272us 30.291us vblank_mode=0 glxgears: ~11000fps ~9000fps Since the earlier submission, gen8 ringbuffer submission has fallen further and further behind in features. So while ringbuffer may hold the throughput crown, in terms of interactive latency, execlists is much better. Alas, we have no convenient metrics for such, other than demonstrating things we can do with execlists but can not using legacy ringbuffer submission. We have made a few improvements to lowlevel execlists throughput, and ringbuffer currently panics on boot! (bdw i7-5557u, 20171026): ring execlists exec continuous nops on all rings: n/a 1.921us exec sequential nops on each ring: n/a 44.621us single nop + sync: n/a 21.953us vblank_mode=0 glxgears: n/a ~18500fps References: https://bugs.freedesktop.org/show_bug.cgi?id=87725 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Once-upon-a-time-Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-2-chris@chris-wilson.co.uk
* drm/i915: Rename global i915 to i915_modparamsMichal Wajdeczko2017-09-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our global struct with params is named exactly the same way as new preferred name for the drm_i915_private function parameter. To avoid such name reuse lets use different name for the global. v5: pure rename v6: fix Credits-to: Coccinelle @@ identifier n; @@ ( - i915.n + i915_modparams.n ) Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Ville Syrjala <ville.syrjala@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Acked-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170919193846.38060-1-michal.wajdeczko@intel.com
* drm/i915: Check execlist/ring status during hangcheckChris Wilson2017-07-271-1/+1
| | | | | | | | | | | | | Before we declare an engine as idle, check if there are any pending execlist context-switches and if the ring itself reports as idle. Otherwise, we may be left in a situation where we miss a crucial execlist event (or something more sinister) yet the requests complete. Since the seqno write happens, we believe the engine to be truly idle. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170721123238.16428-5-chris@chris-wilson.co.uk Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
* drm/i915: Avoid format string expansion from engine namesKees Cook2017-04-191-1/+1
| | | | | | | | | While highly unlikely, this makes sure that the string built from engine names won't be processed as a format string. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170411045630.GA6612@beast
* drm/i915: Add initial selftests for hang detection and resetsChris Wilson2017-02-131-0/+4
| | | | | | | | | Check that we can reset the GPU and continue executing from the next request. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170213171558.20942-47-chris@chris-wilson.co.uk
* drm/i915: Disable hangcheck when wedgedChris Wilson2016-11-221-0/+3
| | | | | | | | | | | | If the gpu reset fails and the machine is terminally wedged, further hangchecks achieve nothing but noise. Disable them, with a corollary that we re-enable hangchecking after a successful GPU reset in case the user is artificially bringing the machine back to life through the debug interface. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20161122144121.7379-2-chris@chris-wilson.co.uk
* drm/i915: Decouple hang detection from hangcheck periodMika Kuoppala2016-11-211-56/+52Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hangcheck state accumulation has gained more steps along the years, like head movement and more recently the subunit inactivity check. As the subunit sampling is only done if the previous state check showed inactivity, we have added more stages (and time) to reach a hang verdict. Asymmetric engine states led to different actual weight of 'one hangcheck unit' and it was demonstrated in some hangs that due to difference in stages, simpler engines were accused falsely of a hang as their scoring was much more quicker to accumulate above the hang treshold. To completely decouple the hangcheck guilty score from the hangcheck period, convert hangcheck score to a rough period of inactivity measurement. As these are tracked as jiffies, they are meaningful also across reset boundaries. This makes finding a guilty engine more accurate across multi engine activity scenarios, especially across asymmetric engines. We lose the ability to detect cross batch malicious attempts to hinder the progress. Plan is to move this functionality to be part of context banning which is more natural fit, later in the series. v2: use time_before macros (Chris) reinstate the pardoning of moving engine after hc (Chris) v3: avoid global state for per engine stall detection (Chris) v4: take timeline last retirement into account (Chris) v5: do debug print on pardoning, split out retirement timestamp (Chris) Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
* drm/i915: Split up hangcheck phasesMika Kuoppala2016-11-211-104/+137
| | | | | | | | | | | | | | In order to simplify hangcheck state keeping, split hangcheck per engine loop in three phases: state load, action, state save. Add few more hangcheck actions to separate between seqno, head and subunit movements. This helps to gather all the hangcheck actions under a single switch umbrella. Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
* drm/i915: Move hangcheck code out from i915_irq.cMika Kuoppala2016-11-021-0/+450
Create new file for hangcheck specific code, intel_hangcheck.c, and move all related code in it. v2: s/intel_engine_hangcheck/intel_engine (Chris) No functional changes. Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1478018583-5816-1-git-send-email-mika.kuoppala@intel.com