summaryrefslogtreecommitdiffstats
path: root/hw/vfio/pci-quirks.c
Commit message (Collapse)AuthorAgeFilesLines
* hw/vfio/pci-quirks: Replace the word 'blacklist'Philippe Mathieu-Daudé2021-03-161-7/+7
| | | | | | | | | | | | | | | Follow the inclusive terminology from the "Conscious Language in your Open Source Projects" guidelines [*] and replace the word "blacklist" appropriately. [*] https://github.com/conscious-lang/conscious-lang-docs/blob/main/faq.md Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210205171817.2108907-9-philmd@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio: add quirk device write methodPrasad J Pandit2021-02-081-0/+8
| | | | | | | | | | | | | Add vfio quirk device mmio write method to avoid NULL pointer dereference issue. Reported-by: Lei Sun <slei.casper@gmail.com> Reviewed-by: Li Qiang <liq3ea@gmail.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Prasad J Pandit <pjp@fedoraproject.org> Message-Id: <20200811114133.672647-4-ppandit@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* qdev: Rename qdev_get_prop_ptr() to object_field_prop_ptr()Eduardo Habkost2020-12-181-2/+2
| | | | | | | | | | | | | The function will be moved to common QOM code, as it is not specific to TYPE_DEVICE anymore. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Paul Durrant <paul@xen.org> Message-Id: <20201211220529.2290218-31-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
* qdev: Move dev->realized check to qdev_property_set()Eduardo Habkost2020-12-181-6/+0Star
| | | | | | | | | | | | | | Every single qdev property setter function manually checks dev->realized. We can just check dev->realized inside qdev_property_set() instead. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Paul Durrant <paul@xen.org> Message-Id: <20201211220529.2290218-24-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
* qdev: Make qdev_get_prop_ptr() get Object* argEduardo Habkost2020-12-151-3/+2Star
| | | | | | | | | | | Make the code more generic and not specific to TYPE_DEVICE. Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> #s390 parts Acked-by: Paul Durrant <paul@xen.org> Message-Id: <20201211220529.2290218-10-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
* meson: infrastructure for building emulatorsPaolo Bonzini2020-08-211-1/+1
| | | | | Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* error: Eliminate error_propagate() with Coccinelle, part 1Markus Armbruster2020-07-101-3/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away. Convert if (!foo(..., &err)) { ... error_propagate(errp, err); ... return ... } to if (!foo(..., errp)) { ... ... return ... } where nothing else needs @err. Coccinelle script: @rule1 forall@ identifier fun, err, errp, lbl; expression list args, args2; binary operator op; constant c1, c2; symbol false; @@ if ( ( - fun(args, &err, args2) + fun(args, errp, args2) | - !fun(args, &err, args2) + !fun(args, errp, args2) | - fun(args, &err, args2) op c1 + fun(args, errp, args2) op c1 ) ) { ... when != err when != lbl: when strict - error_propagate(errp, err); ... when != err ( return; | return c2; | return false; ) } @rule2 forall@ identifier fun, err, errp, lbl; expression list args, args2; expression var; binary operator op; constant c1, c2; symbol false; @@ - var = fun(args, &err, args2); + var = fun(args, errp, args2); ... when != err if ( ( var | !var | var op c1 ) ) { ... when != err when != lbl: when strict - error_propagate(errp, err); ... when != err ( return; | return c2; | return false; | return var; ) } @depends on rule1 || rule2@ identifier err; @@ - Error *err = NULL; ... when != err Not exactly elegant, I'm afraid. The "when != lbl:" is necessary to avoid transforming if (fun(args, &err)) { goto out } ... out: error_propagate(errp, err); even though other paths to label out still need the error_propagate(). For an actual example, see sclp_realize(). Without the "when strict", Coccinelle transforms vfio_msix_setup(), incorrectly. I don't know what exactly "when strict" does, only that it helps here. The match of return is narrower than what I want, but I can't figure out how to express "return where the operand doesn't use @err". For an example where it's too narrow, see vfio_intx_enable(). Silently fails to convert hw/arm/armsse.c, because Coccinelle gets confused by ARMSSE being used both as typedef and function-like macro there. Converted manually. Line breaks tidied up manually. One nested declaration of @local_err deleted manually. Preexisting unwanted blank line dropped in hw/riscv/sifive_e.c. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-35-armbru@redhat.com>
* qapi: Use returned bool to check for failure, Coccinelle partMarkus Armbruster2020-07-101-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit enables conversion of visit_foo(..., &err); if (err) { ... } to if (!visit_foo(..., errp)) { ... } for visitor functions that now return true / false on success / error. Coccinelle script: @@ identifier fun =~ "check_list|input_type_enum|lv_start_struct|lv_type_bool|lv_type_int64|lv_type_str|lv_type_uint64|output_type_enum|parse_type_bool|parse_type_int64|parse_type_null|parse_type_number|parse_type_size|parse_type_str|parse_type_uint64|print_type_bool|print_type_int64|print_type_null|print_type_number|print_type_size|print_type_str|print_type_uint64|qapi_clone_start_alternate|qapi_clone_start_list|qapi_clone_start_struct|qapi_clone_type_bool|qapi_clone_type_int64|qapi_clone_type_null|qapi_clone_type_number|qapi_clone_type_str|qapi_clone_type_uint64|qapi_dealloc_start_list|qapi_dealloc_start_struct|qapi_dealloc_type_anything|qapi_dealloc_type_bool|qapi_dealloc_type_int64|qapi_dealloc_type_null|qapi_dealloc_type_number|qapi_dealloc_type_str|qapi_dealloc_type_uint64|qobject_input_check_list|qobject_input_check_struct|qobject_input_start_alternate|qobject_input_start_list|qobject_input_start_struct|qobject_input_type_any|qobject_input_type_bool|qobject_input_type_bool_keyval|qobject_input_type_int64|qobject_input_type_int64_keyval|qobject_input_type_null|qobject_input_type_number|qobject_input_type_number_keyval|qobject_input_type_size_keyval|qobject_input_type_str|qobject_input_type_str_keyval|qobject_input_type_uint64|qobject_input_type_uint64_keyval|qobject_output_start_list|qobject_output_start_struct|qobject_output_type_any|qobject_output_type_bool|qobject_output_type_int64|qobject_output_type_null|qobject_output_type_number|qobject_output_type_str|qobject_output_type_uint64|start_list|visit_check_list|visit_check_struct|visit_start_alternate|visit_start_list|visit_start_struct|visit_type_.*"; expression list args; typedef Error; Error *err; @@ - fun(args, &err); - if (err) + if (!fun(args, &err)) { ... } A few line breaks tidied up manually. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-19-armbru@redhat.com>
* hw/vfio/pci-quirks: Fix broken legacy IGD passthroughThomas Huth2020-06-111-0/+1
| | | | | | | | | | | | | The #ifdef CONFIG_VFIO_IGD in pci-quirks.c is not working since the required header config-devices.h is not included, so that the legacy IGD passthrough is currently broken. Let's include the right header to fix this issue. Buglink: https://bugs.launchpad.net/qemu/+bug/1882784 Fixes: 29d62771c81d ("hw/vfio: Move the IGD quirk code to a separate file") Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hw/vfio: Add VMD Passthrough QuirkJon Derrick2020-06-111-12/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | The VMD endpoint provides a real PCIe domain to the guest, including bridges and endpoints. Because the VMD domain is enumerated by the guest kernel, the guest kernel will assign Guest Physical Addresses to the downstream endpoint BARs and bridge windows. When the guest kernel performs MMIO to VMD sub-devices, MMU will translate from the guest address space to the physical address space. Because the bridges have been programmed with guest addresses, the bridges will reject the transaction containing physical addresses. VMD device 28C0 natively assists passthrough by providing the Host Physical Address in shadow registers accessible to the guest for bridge window assignment. The shadow registers are valid if bit 1 is set in VMD VMLOCK config register 0x70. In order to support existing VMDs, this quirk provides the shadow registers in a vendor-specific PCI capability to the vfio-passthrough device for all VMD device ids which don't natively assist with passthrough. The Linux VMD driver is updated to check for this new vendor-specific capability. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/nvlink: Remove exec permission to avoid SELinux AVCsLeonardo Bras2020-05-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | If SELinux is setup without 'execmem' permission for qemu, all mmap with (PROT_WRITE | PROT_EXEC) will fail and print a warning in SELinux log. If "nvlink2-mr" memory allocation fails (fist diff), it will cause guest NUMA nodes to not be correctly configured (V100 memory will not be visible for guest, nor its NUMA nodes). Not having 'execmem' permission is intesting for virtual machines to avoid buffer-overflow based attacks, and it's adopted in distros like RHEL. So, removing the PROT_EXEC flag seems the right thing to do. Browsing some other code that mmaps memory for usage with memory_region_init_ram_device_ptr, I could notice it's usual to not have PROT_EXEC (only PROT_READ | PROT_WRITE), so it should be no problem around this. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Message-Id: <20200501055448.286518-1-leobras.c@gmail.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* qom: Drop parameter @errp of object_property_add() & friendsMarkus Armbruster2020-05-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only way object_property_add() can fail is when a property with the same name already exists. Since our property names are all hardcoded, failure is a programming error, and the appropriate way to handle it is passing &error_abort. Same for its variants, except for object_property_add_child(), which additionally fails when the child already has a parent. Parentage is also under program control, so this is a programming error, too. We have a bit over 500 callers. Almost half of them pass &error_abort, slightly fewer ignore errors, one test case handles errors, and the remaining few callers pass them to their own callers. The previous few commits demonstrated once again that ignoring programming errors is a bad idea. Of the few ones that pass on errors, several violate the Error API. The Error ** argument must be NULL, &error_abort, &error_fatal, or a pointer to a variable containing NULL. Passing an argument of the latter kind twice without clearing it in between is wrong: if the first call sets an error, it no longer points to NULL for the second call. ich9_pm_add_properties(), sparc32_ledma_realize(), sparc32_dma_realize(), xilinx_axidma_realize(), xilinx_enet_realize() are wrong that way. When the one appropriate choice of argument is &error_abort, letting users pick the argument is a bad idea. Drop parameter @errp and assert the preconditions instead. There's one exception to "duplicate property name is a programming error": the way object_property_add() implements the magic (and undocumented) "automatic arrayification". Don't drop @errp there. Instead, rename object_property_add() to object_property_try_add(), and add the obvious wrapper object_property_add(). Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200505152926.18877-15-armbru@redhat.com> [Two semantic rebase conflicts resolved]
* hw/vfio: Move the IGD quirk code to a separate fileThomas Huth2020-02-061-611/+3Star
| | | | | | | | | | | | | | | | | | | The IGD quirk code defines a separate device, the so-called "vfio-pci-igd-lpc-bridge" which shows up as a user-creatable device in all QEMU binaries that include the vfio code. This is a little bit unfortunate for two reasons: First, this device is completely useless in binaries like qemu-system-s390x. Second we also would like to disable it in downstream RHEL which currently requires some extra patches there since the device does not have a proper Kconfig-style switch yet. So it would be good if the device could be disabled more easily, thus let's move the code to a separate file instead and introduce a proper Kconfig switch for it which gets only enabled by default if we also have CONFIG_PC_PCI enabled. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* memory: Access MemoryRegion with endiannessTony Nguyen2019-09-031-2/+3
| | | | | | | | | | | | | | | | | | | Preparation for collapsing the two byte swaps adjust_endianness and handle_bswap into the former. Call memory_region_dispatch_{read|write} with endianness encoded into the "MemOp op" operand. This patch does not change any behaviour as memory_region_dispatch_{read|write} is yet to handle the endianness. Once it does handle endianness, callers with byte swaps can collapse them into adjust_endianness. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Tony Nguyen <tony.nguyen@bt.com> Message-Id: <8066ab3eb037c0388dfadfe53c5118429dd1de3a.1566466906.git.tony.nguyen@bt.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
* hw/vfio: Access MemoryRegion with MemOpTony Nguyen2019-09-031-2/+4
| | | | | | | | | | | | | | | | | | | The memory_region_dispatch_{read|write} operand "unsigned size" is being converted into a "MemOp op". Convert interfaces by using no-op size_memop. After all interfaces are converted, size_memop will be implemented and the memory_region_dispatch_{read|write} operand "unsigned size" will be converted into a "MemOp op". As size_memop is a no-op, this patch does not change any behaviour. Signed-off-by: Tony Nguyen <tony.nguyen@bt.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <e70ff5814ac3656974180db6375397c43b0bc8b8.1566466906.git.tony.nguyen@bt.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
* Include hw/qdev-properties.h lessMarkus Armbruster2019-08-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | In my "build everything" tree, changing hw/qdev-properties.h triggers a recompile of some 2700 out of 6600 objects (not counting tests and objects that don't depend on qemu/osdep.h). Many places including hw/qdev-properties.h (directly or via hw/qdev.h) actually need only hw/qdev-core.h. Include hw/qdev-core.h there instead. hw/qdev.h is actually pointless: all it does is include hw/qdev-core.h and hw/qdev-properties.h, which in turn includes hw/qdev-core.h. Replace the remaining uses of hw/qdev.h by hw/qdev-properties.h. While there, delete a few superfluous inclusions of hw/qdev-core.h. Touching hw/qdev-properties.h now recompiles some 1200 objects. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Daniel P. Berrangé" <berrange@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com> Message-Id: <20190812052359.30071-22-armbru@redhat.com>
* Include hw/hw.h exactly where neededMarkus Armbruster2019-08-161-0/+1
| | | | | | | | | | | | | | | | In my "build everything" tree, changing hw/hw.h triggers a recompile of some 2600 out of 6600 objects (not counting tests and objects that don't depend on qemu/osdep.h). The previous commits have left only the declaration of hw_error() in hw/hw.h. This permits dropping most of its inclusions. Touching it now recompiles less than 200 objects. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Message-Id: <20190812052359.30071-19-armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
* Include qemu/module.h where needed, drop it from qemu-common.hMarkus Armbruster2019-06-121-0/+1
| | | | | | | | | Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-4-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for hw/usb/dev-hub.c hw/misc/exynos4210_rng.c hw/misc/bcm2835_rng.c hw/misc/aspeed_scu.c hw/display/virtio-vga.c hw/arm/stm32f205_soc.c; ui/cocoa.m fixed up]
* spapr: Support NVIDIA V100 GPU with NVLink2Alexey Kardashevskiy2019-04-261-0/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to as the GPU RAM needs to be configured equally distant from any other node in the system. Unlike the host setup which assigns numa ids from 255 downwards, this adds new NUMA nodes after the user configures nodes or from 1 if none were configured. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for vfio portions] Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190312082103.130561-1-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
* pci: Move NVIDIA vendor id to the rest of idsAlexey Kardashevskiy2019-02-221-2/+0Star
| | | | | | | | | | | | sPAPR code will use it too so move it from VFIO to the common code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Message-Id: <20190214051440.59167-1-aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vfio: Clean up error reporting after previous commitMarkus Armbruster2018-10-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit changed vfio's warning messages from vfio warning: DEV-NAME: Could not frobnicate to warning: vfio DEV-NAME: Could not frobnicate To match this change, change error messages from vfio error: DEV-NAME: On fire to vfio DEV-NAME: On fire Note the loss of "error". If we think marking error messages that way is a good idea, we should mark *all* error messages, i.e. make error_report() print it. Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20181017082702.5581-7-armbru@redhat.com>
* hw/vfio: Use the IEC binary prefix definitionsPhilippe Mathieu-Daudé2018-07-021-4/+5
| | | | | | | | | | | | | | It eases code review, unit is explicit. Patch generated using: $ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/ and modified manually. Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20180625124238.25339-38-f4bug@amsat.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* vfio/quirks: Enable ioeventfd quirks to be handled by vfio directlyAlex Williamson2018-06-051-7/+46
| | | | | | | | | | | | | With vfio ioeventfd support, we can program vfio-pci to perform a specified BAR write when an eventfd is triggered. This allows the KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding userspace handling for these events. On the same micro-benchmark where the ioeventfd got us to almost 90% of performance versus disabling the GeForce quirks, this gets us to within 95%. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/quirks: ioeventfd quirk accelerationAlex Williamson2018-06-051-2/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found in device MMIO space. Normally PCI config space is considered a slow path and further optimization is unnecessary, however NVIDIA uses a register here to enable the MSI interrupt to re-trigger. Exiting to QEMU for this MSI-ACK handling can therefore rate limit our interrupt handling. Fortunately the MSI-ACK write is easily detected since the quirk MemoryRegion otherwise has very few accesses, so simply looking for consecutive writes with the same data is sufficient, in this case 10 consecutive writes with the same data and size is arbitrarily chosen. We configure the KVM ioeventfd with data match, so there's no risk of triggering for the wrong data or size, but we do risk that pathological driver behavior might consume all of QEMU's file descriptors, so we cap ourselves to 10 ioeventfds for this purpose. In support of the above, generic ioeventfd infrastructure is added for vfio quirks. This automatically initializes an ioeventfd list per quirk, disables and frees ioeventfds on exit, and allows ioeventfds marked as dynamic to be dropped on device reset. The rationale for this latter feature is that useful ioeventfds may depend on specific driver behavior and since we necessarily place a cap on our use of ioeventfds, a machine reset is a reasonable point at which to assume a new driver and re-profile. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/quirks: Add quirk reset callbackAlex Williamson2018-06-051-0/+15
| | | | | | | | | Quirks can be self modifying, provide a hook to allow them to cleanup on device reset if desired. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/quirks: Add common quirk alloc helperAlex Williamson2018-06-051-27/+21Star
| | | | | | | | This will later be used to include list initialization. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Add option to disable GeForce quirksAlex Williamson2018-02-061-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla assignment. Leaving them enabled is fully functional and provides the most compatibility, but due to the unique NVIDIA MSI ACK behavior[1], it also introduces latency in re-triggering the MSI interrupt. This overhead is typically negligible, but has been shown to adversely affect some (very) high interrupt rate applications. This adds the vfio-pci device option "x-no-geforce-quirks=" which can be set to "on" to disable this additional overhead. A follow-on optimization for GeForce might be to make use of an ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci driver, avoiding the bounce through userspace to handle this device write. [1] Background: the NVIDIA driver has been observed to issue a write to the MMIO mirror of PCI config space in BAR0 in order to allow the MSI interrupt for the device to retrigger. Older reports indicated a write of 0xff to the (read-only) MSI capability ID register, while more recently a write of 0x0 is observed at config space offset 0x704, non-architected, extended config space of the device (BAR0 offset 0x88704). Virtualization of this range is only required for GeForce. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devicesEduardo Habkost2017-10-151-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add INTERFACE_CONVENTIONAL_PCI_DEVICE to all direct subtypes of TYPE_PCI_DEVICE, except: 1) The ones that already have INTERFACE_PCIE_DEVICE set: * base-xhci * e1000e * nvme * pvscsi * vfio-pci * virtio-pci * vmxnet3 2) base-pci-bridge Not all PCI bridges are Conventional PCI devices, so INTERFACE_CONVENTIONAL_PCI_DEVICE is added only to the subtypes that are actually Conventional PCI: * dec-21154-p2p-bridge * i82801b11-bridge * pbm-bridge * pci-bridge The direct subtypes of base-pci-bridge not touched by this patch are: * xilinx-pcie-root: Already marked as PCIe-only. * pcie-pci-bridge: Already marked as PCIe-only. * pcie-port: all non-abstract subtypes of pcie-port are already marked as PCIe-only devices. 3) megasas-base Not all megasas devices are Conventional PCI devices, so the interface names are added to the subclasses registered by megasas_register_types(), according to information in the megasas_devices[] array. "megasas-gen2" already implements INTERFACE_PCIE_DEVICE, so add INTERFACE_CONVENTIONAL_PCI_DEVICE only to "megasas". Acked-by: Alberto Garcia <berto@igalia.com> Acked-by: John Snow <jsnow@redhat.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* vfio/pci: Add NVIDIA GPUDirect Cliques supportAlex Williamson2017-10-031-0/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NVIDIA has defined a specification for creating GPUDirect "cliques", where devices with the same clique ID support direct peer-to-peer DMA. When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest (part of cuda-samples) determine which GPUs can support peer-to-peer based on chipset and topology. When running in a VM, these tools have no visibility to the physical hardware support or topology. This option allows the user to specify hints via a vendor defined capability. For instance: <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/> </qemu:commandline> This enables two cliques. The first is a singleton clique with ID 0, for the first hostdev defined in the XML (note that since cliques define peer-to-peer sets, singleton clique offer no benefit). The subsequent two hostdevs are both added to clique ID 1, indicating peer-to-peer is possible between these devices. QEMU only provides validation that the clique ID is valid and applied to an NVIDIA graphics device, any validation that the resulting cliques are functional and valid is the user's responsibility. The NVIDIA specification allows a 4-bit clique ID, thus valid values are 0-15. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Add virtual capabilities quirk infrastructureAlex Williamson2017-10-031-0/+4
| | | | | | | | | | If the hypervisor needs to add purely virtual capabilties, give us a hook through quirks to do that. Note that we determine the maximum size for a capability based on the physical device, if we insert a virtual capability, that can change. Therefore if maximum size is smaller after added virt capabilities, use that. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirkAlex Williamson2017-04-071-1/+1
| | | | | | | | | The NVIDIA BAR5 quirk is targeting an ioport BAR. Some older devices have a BAR5 which is not ioport and can induce a segfault here. Test the BAR type to skip these devices. Link: https://bugs.launchpad.net/qemu/+bug/1678466 Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* Revert "vfio/pci-quirks.c: Disable stolen memory for igd VFIO"Xiong Zhang2017-03-311-38/+27Star
| | | | | | | | | | | | | | | This reverts commit c2b2e158cc7b1cb431bd6039824ec13c3184a775. The original patch intend to prevent linux i915 driver from using stolen meory. But this patch breaks windows IGD driver loading on Gen9+, as IGD HW will use stolen memory on Gen9+, once windows IGD driver see zero size stolen memory, it will unload. Meanwhile stolen memory will be disabled in 915 when i915 run as a guest. Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com> [aw: Gen9+ is SkyLake and newer] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci-quirks.c: Disable stolen memory for igd VFIOXiongZhang2017-02-221-27/+38
| | | | | | | | | | | | | | | | | | Regardless of running in UPT or legacy mode, the guest igd drivers may attempt to use stolen memory, however only legacy mode has BIOS support for reserving stolen memmory in the guest VM. We zero out the stolen memory size in all cases, then guest igd driver won't use stolen memory. In legacy mode, user could use x-igd-gms option to specify the amount of stolen memory which will be pre-allocated and reserved by bios for igd use. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99028 https://bugs.freedesktop.org/show_bug.cgi?id=99025 Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hw/vfio/pci-quirks: Set category of the "vfio-pci-igd-lpc-bridge" deviceThomas Huth2017-02-101-0/+1
| | | | | | | | The device has "bridge" in its name, so it should obviously be in the category DEVICE_CATEGORY_BRIDGE. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio-pci: Fix GTT wrap-around for Skylake+ IGDAlex Williamson2017-02-101-1/+4
| | | | | | | | | | | | | | | | | | Previous IGD, up through Broadwell, only seem to write GTT values into the first 1MB of space allocated for the BDSM, but clearly the GTT can be multiple MB in size. Our test in vfio_igd_quirk_data_write() correctly filters out indexes beyond 1MB, but given the 1MB mask we're using, we re-apply writes only to the first 1MB of the guest allocated BDSM. We can't assume either the host or guest BDSM is naturally aligned, so we can't simply apply a different mask. Instead, save the host BDSM and do the arithmetic to subtract the host value to get the BDSM offset and add it to the guest allocated BDSM. Reported-by: Alexander Indenbaum <alexander.indenbaum@gmail.com> Tested-by: Alexander Indenbaum <alexander.indenbaum@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hw: Fix typos found by codespellStefan Weil2017-01-241-1/+1
| | | | | | Signed-off-by: Stefan Weil <sw@weilnetz.de> Acked-by: Alistair Francis <alistair.francis@xilinx.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
* vfio/pci: Fix vfio_rtl8168_quirk_data_read address offsetThorsten Kohfeldt2016-10-171-1/+1
| | | | | | | | | | | | | Introductory comment for rtl8168 VFIO MSI-X quirk states: At BAR2 offset 0x70 there is a dword data register, offset 0x74 is a dword address register. vfio: vfio_bar_read(0000:05:00.0:BAR2+0x70, 4) = 0xfee00398 // read data Thus, correct offset for data read is 0x70, but function vfio_rtl8168_quirk_data_read() wrongfully uses offset 0x74. Signed-off-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Pass an error object to vfio_pci_igd_opregion_initEric Auger2016-10-171-5/+5
| | | | | | | | | | Pass an error object to prepare for migration to VFIO-PCI realize. In vfio_probe_igd_bar4_quirk, simply report the error. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Pass an error object to vfio_populate_vgaEric Auger2016-10-171-1/+3
| | | | | | | | | | | | | | | Pass an error object to prepare for the same operation in vfio_populate_device. Eventually this contributes to the migration to VFIO-PCI realize. We now report an error on vfio_get_region_info failure. vfio_probe_igd_bar4_quirk is not involved in the migration to realize and simply calls error_reportf_err. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Fix VGA quirksAlex Williamson2016-06-301-4/+4
| | | | | | | | | | | | | | | Commit 2d82f8a3cdb2 ("vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions") converted VFIOPCIDevice.vga to be dynamically allocted, negating the need for VFIOPCIDevice.has_vga. Unfortunately not all of the has_vga users were converted, nor was the field removed from the structure. Correct these oversights. Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de> Tested-by: Peter Maloney <peter.maloney@brockmann-consult.de> Fixes: 2d82f8a3cdb2 ("vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions") Fixes: https://bugs.launchpad.net/qemu/+bug/1591628 Cc: qemu-stable@nongnu.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Add a separate option for IGD OpRegion supportAlex Williamson2016-05-261-2/+2
| | | | | | | | | | | | | | The IGD OpRegion is enabled automatically when running in legacy mode, but it can sometimes be useful in universal passthrough mode as well. Without an OpRegion, output spigots don't work, and even though Intel doesn't officially support physical outputs in UPT mode, it's a useful feature. Note that if an OpRegion is enabled but a monitor is not connected, some graphics features will be disabled in the guest versus a headless system without an OpRegion, where they would work. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Tested-by: Gerd Hoffmann <kraxel@redhat.com>
* vfio/pci: Intel graphics legacy mode assignmentAlex Williamson2016-05-261-1/+642
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable quirks to support SandyBridge and newer IGD devices as primary VM graphics. This requires new vfio-pci device specific regions added in kernel v4.6 to expose the IGD OpRegion, the shadow ROM, and config space access to the PCI host bridge and LPC/ISA bridge. VM firmware support, SeaBIOS only so far, is also required for reserving memory regions for IGD specific use. In order to enable this mode, IGD must be assigned to the VM at PCI bus address 00:02.0, it must have a ROM, it must be able to enable VGA, it must have or be able to create on its own an LPC/ISA bridge of the proper type at PCI bus address 00:1f.0 (sorry, not compatible with Q35 yet), and it must have the above noted vfio-pci kernel features and BIOS. The intention is that to enable this mode, a user simply needs to assign 00:02.0 from the host to 00:02.0 in the VM: -device vfio-pci,host=0000:00:02.0,bus=pci.0,addr=02.0 and everything either happens automatically or it doesn't. In the case that it doesn't, we leave error reports, but assume the device will operate in universal passthrough mode (UPT), which doesn't require any of this, but has a much more narrow window of supported devices, supported use cases, and supported guest drivers. When using IGD in this mode, the VM firmware is required to reserve some VM RAM for the OpRegion (on the order or several 4k pages) and stolen memory for the GTT (up to 8MB for the latest GPUs). An additional option, x-igd-gms allows the user to specify some amount of additional memory (value is number of 32MB chunks up to 512MB) that is pre-allocated for graphics use. TBH, I don't know of anything that requires this or makes use of this memory, which is why we don't allocate any by default, but the specification suggests this is not actually a valid combination, so the option exists as a workaround. Please report if it's actually necessary in some environment. See code comments for further discussion about the actual operation of the quirks necessary to assign these devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Tested-by: Gerd Hoffmann <kraxel@redhat.com>
* vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functionsAlex Williamson2016-03-111-19/+19
| | | | | | | | Match common vfio code with setup, exit, and finalize functions for BAR, quirk, and VGA management. VGA is also changed to dynamic allocation to match the other MemoryRegions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio: Generalize region supportAlex Williamson2016-03-111-12/+12
| | | | | | | | | | | | | | | | | Both platform and PCI vfio drivers create a "slow", I/O memory region with one or more mmap memory regions overlayed when supported by the device. Generalize this to a set of common helpers in the core that pulls the region info from vfio, fills the region data, configures slow mapping, and adds helpers for comleting the mmap, enable/disable, and teardown. This can be immediately used by the PCI MSI-X code, which needs to mmap around the MSI-X vector table. This also changes VFIORegion.mem to be dynamically allocated because otherwise we don't know how the caller has allocated VFIORegion and therefore don't know whether to unreference it to destroy the MemoryRegion or not. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hw/vfio: Clean up includesPeter Maydell2016-01-291-0/+1
| | | | | | | | | | Clean up includes so that osdep.h is included first and headers which it implies are not included manually. This commit was created with scripts/clean-includes. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 1453832250-766-22-git-send-email-peter.maydell@linaro.org
* vfio/pci-quirks: Only quirk to size of PCI config spaceAlex Williamson2016-01-191-3/+3
| | | | | | | | | | | | | | | | | | | | | For quirks that support the full PCIe extended config space, limit the quirk to only the size of config space available through vfio. This allows host systems with broken MMCONFIG regions to still make use of these quirks without generating bad address faults trying to access beyond the end of config space exposed through vfio. This may expose direct access to the mirror of extended config space, only trapping the sub-range of standard config space, but allowing this makes the quirk, and thus the device, functional. We expect that only device specific accesses make use of the mirror, not general extended PCI capability accesses, so any virtualization in this space is likely unnecessary anyway, and the device is still IOMMU isolated, so it should only be able to hurt itself through any bogus configurations enabled by this space. Link: https://www.redhat.com/archives/vfio-users/2015-November/msg00192.html Reported-by: Ronnie Swanink <ronnie@ronnieswanink.nl> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio: Use g_new() & friends where that makes obvious senseMarkus Armbruster2015-11-101-8/+8
| | | | | | | | | | | | | g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer, for two reasons. One, it catches multiplication overflowing size_t. Two, it returns T * rather than void *, which lets the compiler catch more type errors. This commit only touches allocations with size arguments of the form sizeof(T). Same Coccinelle semantic patch as in commit b45c03f. Signed-off-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Remove use of g_malloc0_n() from quirksAlex Williamson2015-09-241-8/+8
| | | | | | | For compatibility with glib 2.22. Reported-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Add emulated PCI IDsAlex Williamson2015-09-231-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | Specifying an emulated PCI vendor/device ID can be useful for testing various quirk paths, even though the behavior and functionality of the device with bogus IDs is fully unsupportable. We need to use a uint32_t for the vendor/device IDs, even though the registers themselves are only 16-bit in order to be able to determine whether the value is valid and user set. The same support is added for subsystem vendor/device ID, though these have the possibility of being useful and supported for more than a testing tool. An emulated platform might want to impose their own subsystem IDs or at least hide the physical subsystem ID. Windows guests will often reinstall drivers due to a change in subsystem IDs, something that VM users may want to avoid. Of course careful attention would be required to ensure that guest drivers do not rely on the subsystem ID as a basis for device driver quirks. All of these options are added using the standard experimental option prefix and should not be considered stable. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Cache vendor and device IDAlex Williamson2015-09-231-14/+4Star
| | | | | | | Simplify access to commonly referenced PCI vendor and device ID by caching it on the VFIOPCIDevice struct. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>