7 files changed, 449 insertions, 69 deletions
diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc
index 92e25872aa..1f28fec0d0 100644
--- a/docs/devel/ci-jobs.rst.inc
+++ b/docs/devel/ci-jobs.rst.inc
@@ -1,3 +1,5 @@
+.. _ci_var:
+
 Custom CI/CD variables
 ======================
 
@@ -28,7 +30,113 @@ For further information about how to set these variables, please refer to::
 
   https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd
 
-Here is a list of the most used variables:
+Setting aliases in your git config
+----------------------------------
+
+You can use aliases to make it easier to push branches with different
+CI configurations. For example define an alias for triggering CI:
+
+.. code::
+
+   git config --local alias.push-ci "push -o ci.variable=QEMU_CI=1"
+   git config --local alias.push-ci-now "push -o ci.variable=QEMU_CI=2"
+
+Which lets you run:
+
+.. code::
+
+   git push-ci
+
+to create the pipeline, or:
+
+.. code::
+
+   git push-ci-now
+
+to create and run the pipeline
+
+
+Variable naming and grouping
+----------------------------
+
+The variables used by QEMU's CI configuration are grouped together
+in a handful of namespaces
+
+ * QEMU_JOB_nnnn - variables to be defined in individual jobs
+   or templates, to influence the shared rules defined in the
+   .base_job_template.
+
+ * QEMU_CI_nnn - variables to be set by contributors in their
+   repository CI settings, or as git push variables, to influence
+   which jobs get run in a pipeline
+
+ * nnn - other misc variables not falling into the above
+   categories, or using different names for historical reasons
+   and not yet converted.
+
+Maintainer controlled job variables
+-----------------------------------
+
+The following variables may be set when defining a job in the
+CI configuration file.
+
+QEMU_JOB_CIRRUS
+~~~~~~~~~~~~~~~
+
+The job makes use of Cirrus CI infrastructure, requiring the
+configuration setup for cirrus-run to be present in the repository
+
+QEMU_JOB_OPTIONAL
+~~~~~~~~~~~~~~~~~
+
+The job is expected to be successful in general, but is not run
+by default due to need to conserve limited CI resources. It is
+available to be started manually by the contributor in the CI
+pipelines UI.
+
+QEMU_JOB_ONLY_FORKS
+~~~~~~~~~~~~~~~~~~~
+
+The job results are only of interest to contributors prior to
+submitting code. They are not required as part of the gating
+CI pipeline.
+
+QEMU_JOB_SKIPPED
+~~~~~~~~~~~~~~~~
+
+The job is not reliably successsful in general, so is not
+currently suitable to be run by default. Ideally this should
+be a temporary marker until the problems can be addressed, or
+the job permanently removed.
+
+QEMU_JOB_PUBLISH
+~~~~~~~~~~~~~~~~
+
+The job is for publishing content after a branch has been
+merged into the upstream default branch.
+
+QEMU_JOB_AVOCADO
+~~~~~~~~~~~~~~~~
+
+The job runs the Avocado integration test suite
+
+Contributor controlled runtime variables
+----------------------------------------
+
+The following variables may be set by contributors to control
+job execution
+
+QEMU_CI
+~~~~~~~
+
+By default, no pipelines will be created on contributor forks
+in order to preserve CI credits
+
+Set this variable to 1 to create the pipelines, but leave all
+the jobs to be manually started from the UI
+
+Set this variable to 2 to create the pipelines and run all
+the jobs immediately, as was historicaly behaviour
 
 QEMU_CI_AVOCADO_TESTING
 ~~~~~~~~~~~~~~~~~~~~~~~
@@ -38,6 +146,12 @@ these artifacts are not already cached, downloading them make the jobs
 reach the timeout limit). Set this variable to have the tests using the
 Avocado framework run automatically.
 
+Other misc variables
+--------------------
+
+These variables are primarily to control execution of jobs on
+private runners
+
 AARCH64_RUNNER_AVAILABLE
 ~~~~~~~~~~~~~~~~~~~~~~~~
 If you've got access to an aarch64 host that can be used as a gitlab-CI
diff --git a/docs/devel/ci.rst b/docs/devel/ci.rst
index d106610096..ed88a2010b 100644
--- a/docs/devel/ci.rst
+++ b/docs/devel/ci.rst
@@ -1,12 +1,13 @@
+.. _ci:
+
 ==
 CI
 ==
 
-QEMU has configurations enabled for a number of different CI services.
-The most up to date information about them and their status can be
-found at::
-
-   https://wiki.qemu.org/Testing/CI
+Most of QEMU's CI is run on GitLab's infrastructure although a number
+of other CI services are used for specialised purposes. The most up to
+date information about them and their status can be found on the
+`project wiki testing page <https://wiki.qemu.org/Testing/CI>`_.
 
 .. include:: ci-definitions.rst.inc
 .. include:: ci-jobs.rst.inc
diff --git a/docs/devel/index-tcg.rst b/docs/devel/index-tcg.rst
index 0b0ad12c22..7b9760b26f 100644
--- a/docs/devel/index-tcg.rst
+++ b/docs/devel/index-tcg.rst
@@ -13,3 +13,4 @@ are only implementing things for HW accelerated hypervisors.
    multi-thread-tcg
    tcg-icount
    tcg-plugins
+   replay
diff --git a/docs/devel/replay.rst b/docs/devel/replay.rst
new file mode 100644
index 0000000000..0244be8b9c
--- /dev/null
+++ b/docs/devel/replay.rst
@@ -0,0 +1,306 @@
+..
+   Copyright (c) 2022, ISP RAS
+   Written by Pavel Dovgalyuk and Alex Bennée
+
+=======================
+Execution Record/Replay
+=======================
+
+Core concepts
+=============
+
+Record/replay functions are used for the deterministic replay of qemu
+execution. Execution recording writes a non-deterministic events log, which
+can be later used for replaying the execution anywhere and for unlimited
+number of times. Execution replaying reads the log and replays all
+non-deterministic events including external input, hardware clocks,
+and interrupts.
+
+Several parts of QEMU include function calls to make event log recording
+and replaying.
+Devices' models that have non-deterministic input from external devices were
+changed to write every external event into the execution log immediately.
+E.g. network packets are written into the log when they arrive into the virtual
+network adapter.
+
+All non-deterministic events are coming from these devices. But to
+replay them we need to know at which moments they occur. We specify
+these moments by counting the number of instructions executed between
+every pair of consecutive events.
+
+Academic papers with description of deterministic replay implementation:
+
+* `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_
+* `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_
+
+Modifications of qemu include:
+
+ * wrappers for clock and time functions to save their return values in the log
+ * saving different asynchronous events (e.g. system shutdown) into the log
+ * synchronization of the bottom halves execution
+ * synchronization of the threads from thread pool
+ * recording/replaying user input (mouse, keyboard, and microphone)
+ * adding internal checkpoints for cpu and io synchronization
+ * network filter for recording and replaying the packets
+ * block driver for making block layer deterministic
+ * serial port input record and replay
+ * recording of random numbers obtained from the external sources
+
+Instruction counting
+--------------------
+
+QEMU should work in icount mode to use record/replay feature. icount was
+designed to allow deterministic execution in absence of external inputs
+of the virtual machine. We also use icount to control the occurrence of the
+non-deterministic events. The number of instructions elapsed from the last event
+is written to the log while recording the execution. In replay mode we
+can predict when to inject that event using the instruction counter.
+
+Locking and thread synchronisation
+----------------------------------
+
+Previously the synchronisation of the main thread and the vCPU thread
+was ensured by the holding of the BQL. However the trend has been to
+reduce the time the BQL was held across the system including under TCG
+system emulation. As it is important that batches of events are kept
+in sequence (e.g. expiring timers and checkpoints in the main thread
+while instruction checkpoints are written by the vCPU thread) we need
+another lock to keep things in lock-step. This role is now handled by
+the replay_mutex_lock. It used to be held only for each event being
+written but now it is held for a whole execution period. This results
+in a deterministic ping-pong between the two main threads.
+
+As the BQL is now a finer grained lock than the replay_lock it is almost
+certainly a bug, and a source of deadlocks, to take the
+replay_mutex_lock while the BQL is held. This is enforced by an assert.
+While the unlocks are usually in the reverse order, this is not
+necessary; you can drop the replay_lock while holding the BQL, without
+doing a more complicated unlock_iothread/replay_unlock/lock_iothread
+sequence.
+
+Checkpoints
+-----------
+
+Replaying the execution of virtual machine is bound by sources of
+non-determinism. These are inputs from clock and peripheral devices,
+and QEMU thread scheduling. Thread scheduling affect on processing events
+from timers, asynchronous input-output, and bottom halves.
+
+Invocations of timers are coupled with clock reads and changing the state
+of the virtual machine. Reads produce non-deterministic data taken from
+host clock. And VM state changes should preserve their order. Their relative
+order in replay mode must replicate the order of callbacks in record mode.
+To preserve this order we use checkpoints. When a specific clock is processed
+in record mode we save to the log special "checkpoint" event.
+Checkpoints here do not refer to virtual machine snapshots. They are just
+record/replay events used for synchronization.
+
+QEMU in replay mode will try to invoke timers processing in random moment
+of time. That's why we do not process a group of timers until the checkpoint
+event will be read from the log. Such an event allows synchronizing CPU
+execution and timer events.
+
+Two other checkpoints govern the "warping" of the virtual clock.
+While the virtual machine is idle, the virtual clock increments at
+1 ns per *real time* nanosecond.  This is done by setting up a timer
+(called the warp timer) on the virtual real time clock, so that the
+timer fires at the next deadline of the virtual clock; the virtual clock
+is then incremented (which is called "warping" the virtual clock) as
+soon as the timer fires or the CPUs need to go out of the idle state.
+Two functions are used for this purpose; because these actions change
+virtual machine state and must be deterministic, each of them creates a
+checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so
+starts accounting real time to virtual clock. ``icount_account_warp_timer``
+is called when the CPUs get an interrupt or when the warp timer fires,
+and it warps the virtual clock by the amount of real time that has passed
+since ``icount_start_warp_timer``.
+
+Virtual devices
+===============
+
+Record/replay mechanism, that could be enabled through icount mode, expects
+the virtual devices to satisfy the following requirement:
+everything that affects
+the guest state during execution in icount mode should be deterministic.
+
+Timers
+------
+
+Timers are used to execute callbacks from different subsystems of QEMU
+at the specified moments of time. There are several kinds of timers:
+
+ * Real time clock. Based on host time and used only for callbacks that
+   do not change the virtual machine state. For this reason real time
+   clock and timers does not affect deterministic replay at all.
+ * Virtual clock. These timers run only during the emulation. In icount
+   mode virtual clock value is calculated using executed instructions counter.
+   That is why it is completely deterministic and does not have to be recorded.
+ * Host clock. This clock is used by device models that simulate real time
+   sources (e.g. real time clock chip). Host clock is the one of the sources
+   of non-determinism. Host clock read operations should be logged to
+   make the execution deterministic.
+ * Virtual real time clock. This clock is similar to real time clock but
+   it is used only for increasing virtual clock while virtual machine is
+   sleeping. Due to its nature it is also non-deterministic as the host clock
+   and has to be logged too.
+
+All virtual devices should use virtual clock for timers that change the guest
+state. Virtual clock is deterministic, therefore such timers are deterministic
+too.
+
+Virtual devices can also use realtime clock for the events that do not change
+the guest state directly. When the clock ticking should depend on VM execution
+speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
+but its speed depends on the guest execution. This clock is used by
+the virtual devices (e.g., slirp routing device) that lie outside the
+replayed guest.
+
+Block devices
+-------------
+
+Block devices record/replay module (``blkreplay``) intercepts calls of
+bdrv coroutine functions at the top of block drivers stack.
+
+All block completion operations are added to the queue in the coroutines.
+When the queue is flushed the information about processed requests
+is recorded to the log. In replay phase the queue is matched with
+events read from the log. Therefore block devices requests are processed
+deterministically.
+
+Bottom halves
+-------------
+
+Bottom half callbacks, that affect the guest state, should be invoked through
+``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions.
+Their invocations are saved in record mode and synchronized with the existing
+log in replay mode.
+
+Disk I/O events are completely deterministic in our model, because
+in both record and replay modes we start virtual machine from the same
+disk state. But callbacks that virtual disk controller uses for reading and
+writing the disk may occur at different moments of time in record and replay
+modes.
+
+Reading and writing requests are created by CPU thread of QEMU. Later these
+requests proceed to block layer which creates "bottom halves". Bottom
+halves consist of callback and its parameters. They are processed when
+main loop locks the global mutex. These locks are not synchronized with
+replaying process because main loop also processes the events that do not
+affect the virtual machine state (like user interaction with monitor).
+
+That is why we had to implement saving and replaying bottom halves callbacks
+synchronously to the CPU execution. When the callback is about to execute
+it is added to the queue in the replay module. This queue is written to the
+log when its callbacks are executed. In replay mode callbacks are not processed
+until the corresponding event is read from the events log file.
+
+Sometimes the block layer uses asynchronous callbacks for its internal purposes
+(like reading or writing VM snapshots or disk image cluster tables). In this
+case bottom halves are not marked as "replayable" and do not saved
+into the log.
+
+Saving/restoring the VM state
+-----------------------------
+
+All fields in the device state structure (including virtual timers)
+should be restored by loadvm to the same values they had before savevm.
+
+Avoid accessing other devices' state, because the order of saving/restoring
+is not defined. It means that you should not call functions like
+``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid
+the dependencies that may make restoring the VM state non-deterministic.
+
+Stopping the VM
+---------------
+
+Stopping the guest should not interfere with its state (with the exception
+of the network connections, that could be broken by the remote timeouts).
+VM can be stopped at any moment of replay by the user. Restarting the VM
+after that stop should not break the replay by the unneeded guest state change.
+
+Replay log format
+=================
+
+Record/replay log consists of the header and the sequence of execution
+events. The header includes 4-byte replay version id and 8-byte reserved
+field. Version is updated every time replay log format changes to prevent
+using replay log created by another build of qemu.
+
+The sequence of the events describes virtual machine state changes.
+It includes all non-deterministic inputs of VM, synchronization marks and
+instruction counts used to correctly inject inputs at replay.
+
+Synchronization marks (checkpoints) are used for synchronizing qemu threads
+that perform operations with virtual hardware. These operations may change
+system's state (e.g., change some register or generate interrupt) and
+therefore should execute synchronously with CPU thread.
+
+Every event in the log includes 1-byte event id and optional arguments.
+When argument is an array, it is stored as 4-byte array length
+and corresponding number of bytes with data.
+Here is the list of events that are written into the log:
+
+ - EVENT_INSTRUCTION. Instructions executed since last event. Followed by:
+
+   - 4-byte number of executed instructions.
+
+ - EVENT_INTERRUPT. Used to synchronize interrupt processing.
+ - EVENT_EXCEPTION. Used to synchronize exception handling.
+ - EVENT_ASYNC. This is a group of events. When such an event is generated,
+   it is stored in the queue and processed in icount_account_warp_timer().
+   Every such event has it's own id from the following list:
+
+     - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
+       callbacks that affect virtual machine state, but normally called
+       asynchronously. Followed by:
+
+        - 8-byte operation id.
+
+     - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
+       parameters of keyboard and mouse input operations
+       (key press/release, mouse pointer movement). Followed by:
+
+        - 9-16 bytes depending of input event.
+
+     - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
+     - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
+       initiated by the sender. Followed by:
+
+        - 1-byte character device id.
+        - Array with bytes were read.
+
+     - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
+       operations with disk and flash drives with CPU. Followed by:
+
+        - 8-byte operation id.
+
+     - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by:
+
+        - 1-byte network adapter id.
+        - 4-byte packet flags.
+        - Array with packet bytes.
+
+ - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
+   e.g., by closing the window.
+ - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by:
+
+    - 4-byte output function return value.
+    - 4-byte offset in the output array.
+
+ - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
+   initiated by qemu. Followed by:
+
+    - Array with bytes that were read.
+
+ - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
+   initiated by qemu. Followed by:
+
+    - 4-byte error code.
+
+ - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by:
+
+    - 8-byte clock value.
+
+ - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
+   CPU, internal threads, and asynchronous input events.
+ - EVENT_END. Last event in the log.
diff --git a/docs/devel/replay.txt b/docs/devel/replay.txt
deleted file mode 100644
index e641c35add..0000000000
--- a/docs/devel/replay.txt
+++ /dev/null
@@ -1,46 +0,0 @@
-Record/replay mechanism, that could be enabled through icount mode, expects
-the virtual devices to satisfy the following requirements.
-
-The main idea behind this document is that everything that affects
-the guest state during execution in icount mode should be deterministic.
-
-Timers
-======
-
-All virtual devices should use virtual clock for timers that change the guest
-state. Virtual clock is deterministic, therefore such timers are deterministic
-too.
-
-Virtual devices can also use realtime clock for the events that do not change
-the guest state directly. When the clock ticking should depend on VM execution
-speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
-but its speed depends on the guest execution. This clock is used by
-the virtual devices (e.g., slirp routing device) that lie outside the
-replayed guest.
-
-Bottom halves
-=============
-
-Bottom half callbacks, that affect the guest state, should be invoked through
-replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
-Their invocations are saved in record mode and synchronized with the existing
-log in replay mode.
-
-Saving/restoring the VM state
-=============================
-
-All fields in the device state structure (including virtual timers)
-should be restored by loadvm to the same values they had before savevm.
-
-Avoid accessing other devices' state, because the order of saving/restoring
-is not defined. It means that you should not call functions like
-'update_irq' in post_load callback. Save everything explicitly to avoid
-the dependencies that may make restoring the VM state non-deterministic.
-
-Stopping the VM
-===============
-
-Stopping the guest should not interfere with its state (with the exception
-of the network connections, that could be broken by the remote timeouts).
-VM can be stopped at any moment of replay by the user. Restarting the VM
-after that stop should not break the replay by the unneeded guest state change.
diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst
index e51259eb9c..d3876ec1b7 100644
--- a/docs/devel/submitting-a-patch.rst
+++ b/docs/devel/submitting-a-patch.rst
@@ -204,23 +204,25 @@ log`` for these keywords for example usage.
 Test your patches
 ~~~~~~~~~~~~~~~~~
 
-Although QEMU has `continuous integration
-services <Testing#Continuous_Integration>`__ that attempt to test
-patches submitted to the list, it still saves everyone time if you have
-already tested that your patch compiles and works. Because QEMU is such
-a large project, it's okay to use configure arguments to limit what is
-built for faster turnaround during your development time; but it is
-still wise to also check that your patches work with a full build before
-submitting a series, especially if your changes might have an unintended
-effect on other areas of the code you don't normally experiment with.
-See `Testing <Testing>`__ for more details on what tests are available.
-Also, it is a wise idea to include a testsuite addition as part of your
-patches - either to ensure that future changes won't regress your new
-feature, or to add a test which exposes the bug that the rest of your
-series fixes. Keeping separate commits for the test and the fix allows
-reviewers to rebase the test to occur first to prove it catches the
-problem, then again to place it last in the series so that bisection
-doesn't land on a known-broken state.
+Although QEMU uses various :ref:`ci` services that attempt to test
+patches submitted to the list, it still saves everyone time if you
+have already tested that your patch compiles and works. Because QEMU
+is such a large project the default configuration won't create a
+testing pipeline on GitLab when a branch is pushed. See the :ref:`CI
+variable documentation<ci_var>` for details on how to control the
+running of tests; but it is still wise to also check that your patches
+work with a full build before submitting a series, especially if your
+changes might have an unintended effect on other areas of the code you
+don't normally experiment with. See :ref:`testing` for more details on
+what tests are available.
+
+Also, it is a wise idea to include a testsuite addition as part of
+your patches - either to ensure that future changes won't regress your
+new feature, or to add a test which exposes the bug that the rest of
+your series fixes. Keeping separate commits for the test and the fix
+allows reviewers to rebase the test to occur first to prove it catches
+the problem, then again to place it last in the series so that
+bisection doesn't land on a known-broken state.
 
 .. _submitting_your_patches:
 
diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 5b60a31807..3f6ebd5073 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -1,3 +1,5 @@
+.. _testing:
+
 Testing in QEMU
 ===============