summaryrefslogtreecommitdiffstats
path: root/hw/ppc/spapr.c
Commit message (Collapse)AuthorAgeFilesLines
...
| * ppc/spapr: Move GPRs setup to one placeAlexey Kardashevskiy2020-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment "pseries" starts in SLOF which only expects the FDT blob pointer in r3. As we are going to introduce a OpenFirmware support in QEMU, we will be booting OF clients directly and these expect a stack pointer in r1, Linux looks at r3/r4 for the initramdisk location (although vmlinux can find this from the device tree but zImage from distro kernels cannot). This extends spapr_cpu_set_entry_state() to take more registers. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200310050733.29805-2-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
| * spapr: Clean up RMA size calculationDavid Gibson2020-03-171-24/+37
| | | | | | | | | | | | | | | | | | | | | | | | Move the calculation of the Real Mode Area (RMA) size into a helper function. While we're there clean it up and correct it in a few ways: * Add comments making it clearer where the various constraints come from * Remove a pointless check that the RMA fits within Node 0 (we've just clamped it so that it does) Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
| * spapr: Don't clamp RMA to 16GiB on new machine typesDavid Gibson2020-03-161-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In spapr_machine_init() we clamp the size of the RMA to 16GiB and the comment saying why doesn't make a whole lot of sense. In fact, this was done because the real mode handling code elsewhere limited the RMA in TCG mode to the maximum value configurable in LPCR[RMLS], 16GiB. But, * Actually LPCR[RMLS] has been able to encode a 256GiB size for a very long time, we just didn't implement it properly in the softmmu * LPCR[RMLS] shouldn't really be relevant anyway, it only was because we used to abuse the RMOR based translation mode in order to handle the fact that we're not modelling the hypervisor parts of the cpu We've now removed those limitations in the modelling so the 16GiB clamp no longer serves a function. However, we can't just remove the limit universally: that would break migration to earlier qemu versions, where the 16GiB RMLS limit still applies, no matter how bad the reasons for it are. So, we replace the 16GiB clamp, with a clamp to a limit defined in the machine type class. We set it to 16 GiB for machine types 4.2 and earlier, but set it to 0 meaning unlimited for the new 5.0 machine type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
| * spapr: Don't attempt to clamp RMA to VRMA constraintDavid Gibson2020-03-161-18/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Real Mode Area (RMA) is the part of memory which a guest can access when in real (MMU off) mode. Of course, for a guest under KVM, the MMU isn't really turned off, it's just in a special translation mode - Virtual Real Mode Area (VRMA) - which looks like real mode in guest mode. The mechanics of how this works when using the hash MMU (HPT) put a constraint on the size of the RMA, which depends on the size of the HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA we advertise to the guest based on this VRMA limit. There are several things wrong with this: 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum of Node 0 memory size and the VRMA limit. That will *often* work the same as clamping, but there can be other constraints on RMA size which supersede Node 0 memory size. We have real bugs caused by this (currently worked around in the guest kernel) 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where we're past the point that we can actually advertise an RMA limit to the guest 3) But most fundamentally, the VRMA limit depends on host configuration (page size) which shouldn't be visible to the guest, but this partially exposes it. This can cause problems with migration in certain edge cases, although we will mostly get away with it. In practice, this clamping is almost never applied anyway. With 64kiB pages and the normal rules for sizing of the HPT, the theoretical VRMA limit will be 4x(guest memory size) and so never hit. It will hit with 4kiB pages, where it will be (guest memory size)/4. However all mainstream distro kernels for POWER have used a 64kiB page size for at least 10 years. So, simply replace this logic with a check that the RMA we've calculated based only on guest visible configuration will fit within the host implied VRMA limit. This can break if running HPT guests on a host kernel with 4kiB page size. As noted that's very rare. There also exist several possible workarounds: * Change the host kernel to use 64kiB pages * Use radix MMU (RPT) guests instead of HPT * Use 64kiB hugepages on the host to back guest memory * Increase the guest memory size so that the RMA hits one of the fixed limits before the RMA limit. This is relatively easy on POWER8 which has a 16GiB limit, harder on POWER9 which has a 1TiB limit. * Use a guest NUMA configuration which artificially constrains the RMA within the VRMA limit (the RMA must always fit within Node 0). Previously, on KVM, we also temporarily reduced the rma_size to 256M so that the we'd load the kernel and initrd safely, regardless of the VRMA limit. This was a) confusing, b) could significantly limit the size of images we could load and c) introduced a behavioural difference between KVM and TCG. So we remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org>
| * spapr,ppc: Simplify signature of kvmppc_rma_size()David Gibson2020-03-161-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function calculates the maximum size of the RMA as implied by the host's page size of structure of the VRMA (there are a number of other constraints on the RMA size which will supersede this one in many circumstances). The current interface takes the current RMA size estimate, and clamps it to the VRMA derived size. The only current caller passes in an arguably wrong value (it will match the current RMA estimate in some but not all cases). We want to fix that, but for now just keep concerns separated by having the KVM helper function just return the VRMA derived limit, and let the caller combine it with other constraints. We call the new function kvmppc_vrma_limit() to more clearly indicate its limited responsibility. The helper should only ever be called in the KVM enabled case, so replace its !CONFIG_KVM stub with an assert() rather than a dummy value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cedric Le Goater <clg@fr.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
| * spapr: Don't use weird units for MIN_RMA_SLOFDavid Gibson2020-03-161-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MIN_RMA_SLOF records the minimum about of RMA that the SLOF firmware requires. It lets us give a meaningful error if the RMA ends up too small, rather than just letting SLOF crash. It's currently stored as a number of megabytes, which is strange for global constants. Move that megabyte scaling into the definition of the constant like most other things use. Change from M to MiB in the associated message while we're at it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
* | qom/object: Use common get/set uint helpersFelipe Franciosi2020-03-161-29/+7Star
|/ | | | | | | | | | | | | | | | | Several objects implemented their own uint property getters and setters, despite them being straightforward (without any checks/validations on the values themselves) and identical across objects. This makes use of an enhanced API for object_property_add_uintXX_ptr() which offers default setters. Some of these setters used to update the value even if the type visit failed (eg. because the value being set overflowed over the given type). The new setter introduces a check for these errors, not updating the value if an error occurred. The error is propagated. Signed-off-by: Felipe Franciosi <felipe@nutanix.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* hw: Make MachineClass::is_default a boolean typePhilippe Mathieu-Daudé2020-02-281-1/+1
| | | | | | | | | | | | There's no good reason for it to be type int, change it to bool. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200207161948.15972-3-philmd@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Laurent Vivier <laurent@vivier.eu> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
* Merge tag 'patchew/20200219160953.13771-1-imammedo@redhat.com' of ↵Paolo Bonzini2020-02-251-5/+3Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://github.com/patchew-project/qemu into HEAD This series removes ad hoc RAM allocation API (memory_region_allocate_system_memory) and consolidates it around hostmem backend. It allows to * resolve conflicts between global -mem-prealloc and hostmem's "policy" option, fixing premature allocation before binding policy is applied * simplify complicated memory allocation routines which had to deal with 2 ways to allocate RAM. * reuse hostmem backends of a choice for main RAM without adding extra CLI options to duplicate hostmem features. A recent case was -mem-shared, to enable vhost-user on targets that don't support hostmem backends [1] (ex: s390) * move RAM allocation from individual boards into generic machine code and provide them with prepared MemoryRegion. * clean up deprecated NUMA features which were tied to the old API (see patches) - "numa: remove deprecated -mem-path fallback to anonymous RAM" - (POSTPONED, waiting on libvirt side) "forbid '-numa node,mem' for 5.0 and newer machine types" - (POSTPONED) "numa: remove deprecated implicit RAM distribution between nodes" Introduce a new machine.memory-backend property and wrapper code that aliases global -mem-path and -mem-alloc into automatically created hostmem backend properties (provided memory-backend was not set explicitly given by user). A bulk of trivial patches then follow to incrementally convert individual boards to using machine.memory-backend provided MemoryRegion. Board conversion typically involves: * providing MachineClass::default_ram_size and MachineClass::default_ram_id so generic code could create default backend if user didn't explicitly provide memory-backend or -m options * dropping memory_region_allocate_system_memory() call * using convenience MachineState::ram MemoryRegion, which points to MemoryRegion allocated by ram-memdev On top of that for some boards: * missing ram_size checks are added (typically it were boards with fixed ram size) * ram_size fixups are replaced by checks and hard errors, forcing user to provide correct "-m" values instead of ignoring it and continuing running. After all boards are converted, the old API is removed and memory allocation routines are cleaned up.
| * ppc/spapr: use memdev for RAMIgor Mammedov2020-02-191-5/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memory_region_allocate_system_memory() API is going away, so replace it with memdev allocated MemoryRegion. The later is initialized by generic code, so board only needs to opt in to memdev scheme by providing MachineClass::default_ram_id and using MachineState::ram instead of manually initializing RAM memory region. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200219160953.13771-68-imammedo@redhat.com>
* | spapr: Allow changing offset for -kernel imageAlexey Kardashevskiy2020-02-201-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | This allows moving the kernel in the guest memory. The option is useful for step debugging (as Linux is linked at 0x0); it also allows loading grub which is normally linked to run at 0x20000. This uses the existing kernel address by default. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200203032943.121178-6-aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* | spapr: Add NVDIMM device supportShivaprasad G Bhat2020-02-201-9/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm device interface in QEMU to support virtual NVDIMM devices for Power. Create the required DT entries for the device (some entries have dummy values right now). The patch creates the required DT node and sends a hotplug interrupt to the guest. Guest is expected to undertake the normal DR resource add path in response and start issuing PAPR SCM hcalls. The device support is verified based on the machine version unlike x86. This is how it can be used .. Ex : For coldplug, the device to be added in qemu command line as shown below -object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896 -device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0 For hotplug, the device to be added from monitor as below object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896 device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0 Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> [Early implementation] Message-Id: <158131058078.2897.12767731856697459923.stgit@lep8c.aus.stglabs.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* | ppc: function to setup latest class optionsMichael S. Tsirkin2020-02-201-2/+7
|/ | | | | | | | | | | | We are going to add more init for the latest machine, so move the setup to a function so we don't have to change the DEFINE_SPAPR_MACHINE macro each time. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20200207064628.1196095-1-mst@redhat.com> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* ppc: spapr: Activate the FWNMI functionalityAravinda Prasad2020-02-031-1/+2
| | | | | | | | | | This patch sets the default value of SPAPR_CAP_FWNMI_MCE to SPAPR_CAP_ON for machine type 5.0. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-8-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* migration: Include migration support for machine check handlingAravinda Prasad2020-02-031-0/+47
| | | | | | | | | | | | | | This patch includes migration support for machine check handling. Especially this patch blocks VM migration requests until the machine check error handling is complete as these errors are specific to the source hardware and is irrelevant on the target hardware. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Do not set FWNMI cap in post_load, now its done in .apply hook] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-7-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* target/ppc: Handle NMI guest exitAravinda Prasad2020-02-031-0/+8
| | | | | | | | | | | | | | | | | | | | | Memory error such as bit flips that cannot be corrected by hardware are passed on to the kernel for handling. If the memory address in error belongs to guest then the guest kernel is responsible for taking suitable action. Patch [1] enhances KVM to exit guest with exit reason set to KVM_EXIT_NMI in such cases. This patch handles KVM_EXIT_NMI exit. [1] https://www.spinics.net/lists/kvm-ppc/msg12637.html (e20bbd3d and related commits) Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Message-Id: <20200130184423.20519-4-ganeshgr@linux.ibm.com> [dwg: #ifdefs to fix compile for 32-bit target] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* ppc: spapr: Introduce FWNMI capabilityAravinda Prasad2020-02-031-0/+2
| | | | | | | | | | | | | | | Introduce fwnmi an spapr capability and add a helper function which tries to enable it, which would be used by following patch of the series. This patch by itself does not change the existing behavior. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [eliminate cap_ppc_fwnmi, add fwnmi cap to migration state and reprhase the commit message] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200130184423.20519-3-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Enable DD2.3 accelerated count cache flush in pseries-5.0 machineDavid Gibson2020-02-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For POWER9 DD2.2 cpus, the best current Spectre v2 indirect branch mitigation is "count cache disabled", which is configured with: -machine cap-ibs=fixed-ccd However, this option isn't available on DD2.3 CPUs with KVM, because they don't have the count cache disabled. For POWER9 DD2.3 cpus, it is "count cache flush with assist", configured with: -machine cap-ibs=workaround,cap-ccf-assist=on However this option isn't available on DD2.2 CPUs with KVM, because they don't have the special CCF assist instruction this relies on. On current machine types, we default to "count cache flush w/o assist", that is: -machine cap-ibs=workaround,cap-ccf-assist=off This runs, with mitigation on both DD2.2 and DD2.3 host cpus, but has a fairly significant performance impact. It turns out we can do better. The special instruction that CCF assist uses to trigger a count cache flush is a no-op on earlier CPUs, rather than trapping or causing other badness. It doesn't, of itself, implement the mitigation, but *if* we have count-cache-disabled, then the count cache flush is unnecessary, and so using the count cache flush mitigation is harmless. Therefore for the new pseries-5.0 machine type, enable cap-ccf-assist by default. Along with that, suppress throwing an error if cap-ccf-assist is selected but KVM doesn't support it, as long as KVM *is* giving us count-cache-disabled. To allow TCG to work out of the box, even though it doesn't implement the ccf flush assist, downgrade the error in that case to a warning. This matches several Spectre mitigations where we allow TCG to operate for debugging, since we don't really make guarantees about TCG security properties anyway. While we're there, make the TCG warning for this case match that for other mitigations. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: Michael Ellerman <mpe@ellerman.id.au>
* hw/core/loader: Let load_elf() populate a field with CPU-specific flagsAleksandar Markovic2020-01-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While loading the executable, some platforms (like AVR) need to detect CPU type that executable is built for - and, with this patch, this is enabled by reading the field 'e_flags' of the ELF header of the executable in question. The change expands functionality of the following functions: - load_elf() - load_elf_as() - load_elf_ram() - load_elf_ram_sym() The argument added to these functions is called 'pflags' and is of type 'uint32_t*' (that matches 'pointer to 'elf_word'', 'elf_word' being the type of the field 'e_flags', in both 32-bit and 64-bit variants of ELF header). Callers are allowed to pass NULL as that argument, and in such case no lookup to the field 'e_flags' will happen, and no information will be returned, of course. CC: Richard Henderson <rth@twiddle.net> CC: Peter Maydell <peter.maydell@linaro.org> CC: Edgar E. Iglesias <edgar.iglesias@gmail.com> CC: Michael Walle <michael@walle.cc> CC: Thomas Huth <huth@tuxfamily.org> CC: Laurent Vivier <laurent@vivier.eu> CC: Philippe Mathieu-Daudé <f4bug@amsat.org> CC: Aleksandar Rikalo <aleksandar.rikalo@rt-rk.com> CC: Aurelien Jarno <aurelien@aurel32.net> CC: Jia Liu <proljc@gmail.com> CC: David Gibson <david@gibson.dropbear.id.au> CC: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> CC: BALATON Zoltan <balaton@eik.bme.hu> CC: Christian Borntraeger <borntraeger@de.ibm.com> CC: Thomas Huth <thuth@redhat.com> CC: Artyom Tarasenko <atar4qemu@gmail.com> CC: Fabien Chouteau <chouteau@adacore.com> CC: KONRAD Frederic <frederic.konrad@adacore.com> CC: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Aleksandar Rikalo <aleksandar.rikalo@rt-rk.com> Signed-off-by: Michael Rolnik <mrolnik@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Aleksandar Markovic <amarkovic@wavecomp.com> Message-Id: <1580079311-20447-24-git-send-email-aleksandar.markovic@rt-rk.com>
* migration: Define VMSTATE_INSTANCE_ID_ANYPeter Xu2020-01-201-1/+1
| | | | | | | | | | | Define the new macro VMSTATE_INSTANCE_ID_ANY for callers who wants to auto-generate the vmstate instance ID. Previously it was hard coded as -1 instead of this macro. It helps to change this default value in the follow up patches. No functional change. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
* spapr/xive: remove redundant check in spapr_match_nvt()Cédric Le Goater2020-01-081-4/+4
| | | | | | | | | | | | | | | spapr_match_nvt() is a XIVE operation and is used by the machine to look for a matching target when an event notification is being delivered. An assert checks that spapr_match_nvt() is called only when the machine has selected the XIVE interrupt mode but it is redundant with the XIVE_PRESENTER() dynamic cast. Apply the cast to spapr->active_intc and remove the assert. Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20200106163207.4608-1-clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr.c: remove 'out' label in spapr_dt_cas_updates()Daniel Henrique Barboza2020-01-081-6/+3Star
| | | | | | | | | | | 'out' can be replaced by 'return' with the appropriate return value. CC: David Gibson <david@gibson.dropbear.id.au> CC: qemu-ppc@nongnu.org Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20200106182425.20312-2-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* ppc/spapr: Support reboot of secure pseries guestBharata B Rao2020-01-081-0/+1
| | | | | | | | | | | | | | | | | | | A pseries guest can be run as a secure guest on Ultravisor-enabled POWER platforms. When such a secure guest is reset, we need to release/reset a few resources both on ultravisor and hypervisor side. This is achieved by invoking this new ioctl KVM_PPC_SVM_OFF from the machine reset path. As part of this ioctl, the secure guest is essentially transitioned back to normal mode so that it can reboot like a regular guest and become secure again. This ioctl has no effect when invoked for a normal guest. If this ioctl fails for a secure guest, the guest is terminated. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Message-Id: <20191219031445.8949-3-bharata@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Simplify ovec diffDavid Gibson2019-12-171-11/+3Star
| | | | | | | | | | | | | | | | spapr_ovec_diff(ov, old, new) has somewhat complex semantics. ov is set to those bits which are in new but not old, and it returns as a boolean whether or not there are any bits in old but not new. It turns out that both callers only care about the second, not the first. This is basically equivalent to a bitmap subset operation, which is easier to understand and implement. So replace spapr_ovec_diff() with spapr_ovec_subset(). Cc: Mike Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cedric Le Goater <clg@fr.ibm.com>
* spapr: Fold h_cas_compose_response() into h_client_architecture_support()David Gibson2019-12-171-60/+1Star
| | | | | | | | | | | | | | | | | | spapr_h_cas_compose_response() handles the last piece of the PAPR feature negotiation process invoked via the ibm,client-architecture-support OF call. Its only caller is h_client_architecture_support() which handles most of the rest of that process. I believe it was placed in a separate file originally to handle some fiddly dependencies between functions, but mostly it's just confusing to have the CAS process split into two pieces like this. Now that compose response is simplified (by just generating the whole device tree anew), it's cleaner to just fold it into h_client_architecture_support(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cedric Le Goater <clg@fr.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org>
* spapr: Improve handling of fdt buffer sizeDavid Gibson2019-12-171-22/+11Star
| | | | | | | | | | | | | | | | | Previously, spapr_build_fdt() constructed the device tree in a fixed buffer of size FDT_MAX_SIZE. This is a bit inflexible, but more importantly it's awkward for the case where we use it during CAS. In that case the guest firmware supplies a buffer and we have to awkwardly check that what we generated fits into it afterwards, after doing a lot of size checks during spapr_build_fdt(). Simplify this by having spapr_build_fdt() take a 'space' parameter. For the CAS case, we pass in the buffer size provided by SLOF, for the machine init case, we continue to pass FDT_MAX_SIZE. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cedric Le Goater <clg@fr.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org>
* ppc: well form kvmppc_hint_smt_possible error hint helperVladimir Sementsov-Ogievskiy2019-12-171-1/+1
| | | | | | | | | | | | Make kvmppc_hint_smt_possible hint append helper well formed: rename errp to errp_in, as it is IN-parameter here (which is unusual for errp), rename function to be kvmppc_error_append_*_hint. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20191127191434.20945-1-vsementsov@virtuozzo.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* ppc/spapr: Implement the XiveFabric interfaceCédric Le Goater2019-12-171-0/+39
| | | | | | | | | | | | The CAM line matching sequence in the pseries machine does not change much apart from the use of the new QOM interfaces. There is an extra indirection because of the sPAPR IRQ backend of the machine. Only the XIVE backend implements the new 'match_nvt' handler. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20191125065820.927-11-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* hw: add compat machines for 5.0Cornelia Huck2019-12-141-1/+12
| | | | | | | | | | | | | Add 5.0 machine types for arm/i440fx/q35/s390x/spapr. For i440fx and q35, unversioned cpu models are still translated to -v1; I'll leave changing this (if desired) to the respective maintainers. Signed-off-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20191112104811.30323-1-cohuck@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
* virtio-blk: advertise F_WCE (F_FLUSH) if F_CONFIG_WCE is advertisedEvgeny Yakovlev2019-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Virtio spec 1.1 (and earlier), 5.2.5.2 Driver Requirements: Device Initialization: "Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it if they offer VIRTIO_BLK_F_CONFIG_WCE" Currently F_CONFIG_WCE and F_WCE are not connected to each other. Qemu will advertise F_CONFIG_WCE if config-wce argument is set for virtio-blk device. And F_WCE is advertised only if underlying block backend actually has it's caching enabled. Fix this by advertising F_WCE if F_CONFIG_WCE is also advertised. To preserve backwards compatibility with newer machine types make this behaviour governed by "x-enable-wce-if-config-wce" virtio-blk-device property and introduce hw_compat_4_2 with new property being off by default for all machine types <= 4.2 (but don't introduce 4.3 machine type itself yet). Signed-off-by: Evgeny Yakovlev <wrfsh@yandex-team.ru> Message-Id: <1572978137-189218-1-git-send-email-wrfsh@yandex-team.ru> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* spapr: Add /chosen to FDT only at reset time to preserve kernel and initramdiskAlexey Kardashevskiy2019-11-181-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Since "spapr: Render full FDT on ibm,client-architecture-support" we build the entire flatten device tree (FDT) twice - at the reset time and when "ibm,client-architecture-support" (CAS) is called. The full FDT from CAS is then applied on top of the SLOF internal device tree. This is mostly ok, however there is a case when the QEMU is started with -initrd and for some reason the guest decided to move/unpack the init RAM disk image - the guest correctly notifies SLOF about the change but at CAS it is overridden with the QEMU initial location addresses and the guest may fail to boot if the original initrd memory was changed. This fixes the problem by only adding the /chosen node at the reset time to prevent the original QEMU's linux,initrd-start/linux,initrd-end to override the updated addresses. This only treats /chosen differently as we know there is a special case already and it is unlikely anything else will need to change /chosen at CAS we are better off not touching /chosen after we handed it over to SLOF. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20191024041308.5673-1-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
* spapr: Don't request to unplug the same core twiceGreg Kurz2019-10-241-3/+4
| | | | | | | | | | We must not call spapr_drc_detach() on a detached DRC otherwise bad things can happen, ie. QEMU hangs or crashes. This is easily demonstrated with a CPU hotplug/unplug loop using QMP. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <157185826035.3073024.1664101000438499392.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Move SpaprIrq::nr_xirqs to SpaprMachineClassDavid Gibson2019-10-241-0/+2
| | | | | | | | | | | | | | | | | For the benefit of peripheral device allocation, the number of available irqs really wants to be the same on a given machine type version, regardless of what irq backends we are using. That's the case now, but only because we make sure the different SpaprIrq instances have the same value except for the special legacy one. Since this really only depends on machine type version, move the value to SpaprMachineClass instead of SpaprIrq. This also puts the code to set it to the lower value on old machine types right next to setting legacy_irq_allocation, which needs to go hand in hand. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
* spapr: Remove SpaprIrq::nr_msisDavid Gibson2019-10-241-3/+2Star
| | | | | | | | | | | The nr_msis value we use here has to line up with whether we're using legacy or modern irq allocation. Therefore it's safer to derive it based on legacy_irq_allocation rather than having SpaprIrq contain a canned value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
* spapr, xics, xive: Move dt_populate from SpaprIrq to SpaprInterruptControllerDavid Gibson2019-10-241-2/+1Star
| | | | | | | | | | | This method depends only on the active irq controller. Now that we've formalized the notion of active controller we can dispatch directly through that, rather than dispatching via SpaprIrq with the dual version having to do a second conditional dispatch. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
* spapr, xics, xive: Move print_info from SpaprIrq to SpaprInterruptControllerDavid Gibson2019-10-241-1/+1
| | | | | | | | | | | This method depends only on the active irq controller. Now that we've formalized the notion of active controller we can dispatch directly through that, rather than dispatching via SpaprIrq with the dual version having to do a second conditional dispatch. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
* spapr: Set VSMT to smp_threads by defaultGreg Kurz2019-10-241-1/+6
| | | | | | | | | | | | | | Support for setting VSMT is available in KVM since linux-4.13. Most distros that support KVM on POWER already have it. It thus seem reasonable enough to have the default machine to set VSMT to smp_threads. This brings contiguous VCPU ids and thus brings their upper bound down to the machine's max_cpus. This is especially useful for XIVE KVM devices, which may thus allocate only one VP descriptor per VCPU. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <157010411885.246126.12610015369068227139.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* numa: Introduce MachineClass::auto_enable_numa for implicit NUMA nodeTao Xu2019-10-151-8/+1Star
| | | | | | | | | | | | | Add MachineClass::auto_enable_numa field. When it is true, a NUMA node is expected to be created implicitly. Acked-by: David Gibson <david@gibson.dropbear.id.au> Suggested-by: Igor Mammedov <imammedo@redhat.com> Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Message-Id: <20190905083238.1799-1-tao3.xu@intel.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
* spapr: Use less cryptic representation of which irq backends are supportedDavid Gibson2019-10-041-3/+12
| | | | | | | | | | | | | | | | SpaprIrq::ov5 stores the value for a particular byte in PAPR option vector 5 which indicates whether XICS, XIVE or both interrupt controllers are available. As usual for PAPR, the encoding is kind of overly complicated and confusing (though to be fair there are some backwards compat things it has to handle). But to make our internal code clearer, have SpaprIrq encode more directly which backends are available as two booleans, and derive the OV5 value from that at the point we need it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org>
* spapr: Render full FDT on ibm,client-architecture-supportAlexey Kardashevskiy2019-10-041-79/+9Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ibm,client-architecture-support call is a way for the guest to negotiate capabilities with a hypervisor. It is implemented as: - the guest calls SLOF via client interface; - SLOF calls QEMU (H_CAS hypercall) with an options vector from the guest; - QEMU returns a device tree diff (which uses FDT format with an additional header before it); - SLOF walks through the partial diff tree and updates its internal tree with the values from the diff. This changes QEMU to simply re-render the entire tree and send it as an update. SLOF can handle this already mostly, [1] is needed before this can be applied. This stores the resulting tree in the spapr machine to have the latest valid FDT copy possible (this should not matter much as H_UPDATE_DT happens right after that but nevertheless). The benefit is reduced code size as there is no need for another set of DT rendering helpers such as spapr_fixup_cpu_dt(). The downside is that the updates are bigger now (as they include all nodes and properties) but the difference on a '-smp 256,threads=1' system before/after is 2.35s vs. 2.5s. [1] https://patchwork.ozlabs.org/patch/1152915/ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Stop providing RTAS blobAlexey Kardashevskiy2019-10-041-30/+2Star
| | | | | | | | | | SLOF implements one itself so let's remove it from QEMU. It is one less image and simpler setup as the RTAS blob never stays in its initial place anyway as the guest OS always decides where to put it. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Do not put empty properties for -kernel/-initrd/-appendAlexey Kardashevskiy2019-10-041-5/+10
| | | | | | | | | | | | | | | | | | We are going to use spapr_build_fdt() for the boot time FDT and as an update for SLOF during handling of H_CAS. SLOF will apply all properties from the QEMU's FDT which is usually ok unless there are properties changed by grub or guest kernel. The properties are: bootargs, linux,initrd-start, linux,initrd-end, linux,stdout-path, linux,rtas-base, linux,rtas-entry. Resetting those during CAS will most likely cause grub failure. Don't create such properties if we're booting without "-kernel" and "-initrd" so they won't get included into the DT update blob and therefore the guest is more likely to boot successfully. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [dwg: Tweaked commit message based on Greg Kurz's input] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* spapr: Skip leading zeroes from memory@ DT node namesAlexey Kardashevskiy2019-10-041-1/+1
| | | | | | | | | | | | | | | The device tree build by QEMU at the machine reset time is used by SLOF to build its internal device tree but the node names are not preserved exactly so when QEMU provides a device tree update in response to H_CAS, it might become tricky to match a node from the update blob to the actual node in SLOF. This removed leading zeroes from "memory@" nodes and makes the DTC checker happy. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org>
* spapr: Fixes a leak in CASAlexey Kardashevskiy2019-10-041-0/+1
| | | | | | | | | | Add a missing g_free(fdt) if the resulting tree is bigger than the space allocated by SLOF. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
* spapr: Move handling of special NVLink numa node from reset to initDavid Gibson2019-10-041-10/+11
| | | | | | | | | | | | | | | | The number of NUMA nodes in the system is fixed from the command line. Therefore, there's no need to recalculate it at reset time, and we can determine the special gpu_numa_id value used for NVLink2 devices at init time. This simplifies the reset path a bit which will make further improvements easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
* spapr: Simplify handling of pre ISA 3.0 guest workaround handlingDavid Gibson2019-10-041-6/+4Star
| | | | | | | | | | | | | | | | | | | | | | Certain old guest versions don't understand the radix MMU introduced with POWER ISA 3.0, but incorrectly select it if presented with the option at CAS time. We workaround this in qemu by explicitly excluding the radix (and other ISA 3.0 linked) options if the guest doesn't explicitly note support for ISA 3.0. This is handled by the 'cas_legacy_guest_workaround' flag, which is pretty vague. Rename it to 'cas_pre_isa3_guest' to be clearer about what it's for. In addition, we unnecessarily call spapr_populate_pa_features() with different options when initially constructing the device tree and when adjusting it at CAS time. At the initial construct time cas_pre_isa3_guest is already false, so we can still use the flag, rather than explicitly overriding it to be false at the callsite. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
* spapr: Report kvm_irqchip_in_kernel() in 'info pic'Greg Kurz2019-10-041-0/+4
| | | | | | | | | | | | Unless the machine was started with kernel-irqchip=on, we cannot easily tell if we're actually using an in-kernel or an emulated irqchip. This information is important enough that it is worth printing it in 'info pic'. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <156829860985.2073005.5893493824873412773.stgit@bahia.tls.ibm.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* pseries: do not allow memory-less/cpu-less NUMA nodeLaurent Vivier2019-10-041-0/+33
| | | | | | | | | | | | | | | | | | | | When we hotplug a CPU on memory-less/cpu-less node, the linux kernel crashes. This happens because linux kernel needs to know the NUMA topology at start to be able to initialize the distance lookup table. On pseries, the topology is provided by the firmware via the existing CPUs and memory information. Thus a node without memory and CPU cannot be discovered by the kernel. To avoid the kernel crash, do not allow to start pseries with empty nodes. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-Id: <20190830161345.22436-1-lvivier@redhat.com> [dwg: Rework to cope with movement of numa state from globals to MachineState] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* migration: register_savevm_live doesn't need devDr. David Alan Gilbert2019-09-121-1/+1
| | | | | | | | | | | | Commit 78dd48df3 removed the last caller of register_savevm_live for an instantiable device (rather than a single system wide device); so trim out the parameter. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20190822115433.12070-1-dgilbert@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
* Merge remote-tracking branch ↵Peter Maydell2019-09-041-14/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'remotes/ehabkost/tags/machine-next-pull-request' into staging Machine + x86 queue, 2019-09-03 Bug fixes: * Fix die-id validation regression (Eduardo Habkost) * vmmouse: Properly reset state (Jan Kiszka) * hostmem-file: fix pmem file size check (Stefan Hajnoczi) * Keep query-hotpluggable-cpus output compatible with older QEMU if '-smp dies' is not set (Igor Mammedov) * migration: Do not re-read the clock on pre_save in case of paused guest (Maxiwell S. Garcia) Cleanups: * NUMA code cleanups (Tao Xu) * Remove stale externs from includes (Alex Bennée) Features: * qapi: report the default CPU type for each machine (Daniel P. Berrangé) # gpg: Signature made Tue 03 Sep 2019 21:57:37 BST # gpg: using RSA key 5A322FD5ABC4D3DBACCFD1AA2807936F984DC5A6 # gpg: issuer "ehabkost@redhat.com" # gpg: Good signature from "Eduardo Habkost <ehabkost@redhat.com>" [full] # Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6 * remotes/ehabkost/tags/machine-next-pull-request: migration: Do not re-read the clock on pre_save in case of paused guest x86: do not advertise die-id in query-hotpluggbale-cpus if '-smp dies' is not set i386/vmmouse: Properly reset state hostmem-file: fix pmem file size check qapi: report the default CPU type for each machine pc: Don't make die-id mandatory unless necessary pc: Improve error message when die-id is omitted pc: Fix error message on die-id validation numa: move numa global variable numa_info into MachineState numa: move numa global variable have_numa_distance into MachineState numa: move numa global variable nb_numa_nodes into MachineState hw/arm: simplify arm_load_dtb includes: remove stale [smp|max]_cpus externs Signed-off-by: Peter Maydell <peter.maydell@linaro.org>