summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
...
* block-backend: allow blk_prw from coroutine contextPaolo Bonzini2017-02-211-4/+8
| | | | | | | | | | | | qcow2_create2 calls this. Do not run a nested event loop, as that breaks when aio_co_wake tries to queue the coroutine on the co_queue_wakeup list of the currently running one. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170213135235.12274-4-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: move AioContext, QEMUTimer, main-loop to libqemuutilPaolo Bonzini2017-02-211-29/+0Star
| | | | | | | | | | | | | | | | | | AioContext is fairly self contained, the only dependency is QEMUTimer but that in turn doesn't need anything else. So move them out of block-obj-y to avoid introducing a dependency from io/ to block-obj-y. main-loop and its dependency iohandler also need to be moved, because later in this series io/ will call iohandler_get_aio_context. [Changed copyright "the QEMU team" to "other QEMU contributors" as suggested by Daniel Berrange and agreed by Paolo. --Stefan] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170213135235.12274-2-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* qcow2: Optimize the refcount-block overlap checkAlberto Garcia2017-02-123-1/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The metadata overlap checks introduced in a40f1c2add help detect corruption in the qcow2 image by verifying that data writes don't overlap with existing metadata sections. The 'refcount-block' check in particular iterates over the refcount table in order to get the addresses of all refcount blocks and check that none of them overlap with the region where we want to write. The problem with the refcount table is that since it always occupies complete clusters its size is usually very big. With the default values of cluster_size=64KB and refcount_bits=16 this table holds 8192 entries, each one of them enough to map 2GB worth of host clusters. So unless we're using images with several TB of allocated data this table is going to be mostly empty, and iterating over it is a waste of CPU. If the storage backend is fast enough this can have an effect on I/O performance. This patch keeps the index of the last used (i.e. non-zero) entry in the refcount table and updates it every time the table changes. The refcount-block overlap check then uses that index instead of reading the whole table. In my tests with a 4GB qcow2 file stored in RAM this doubles the amount of write IOPS. Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 20170201123828.4815-1-berto@igalia.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/nfs: fix naming of runtime optsPeter Lieven2017-02-121-23/+23
| | | | | | | | | | | | | | | | | | | | commit 94d6a7a accidentally left the naming of runtime opts and QAPI scheme inconsistent. As one consequence passing of parameters in the URI is broken. Sync the naming of the runtime opts to the QAPI scheme. Please note that this is technically backwards incompatible with the 2.8 release, but the 2.8 release is the only version that had the wrong naming. Furthermore release 2.8 suffered from a NULL pointer dereference during URI parsing. Fixes: 94d6a7a76e9df9919629428f6c598e2b97d9426c Cc: qemu-stable@nongnu.org Signed-off-by: Peter Lieven <pl@kamp.de> Message-id: 1485942829-10756-3-git-send-email-pl@kamp.de [mreitz: Fixed commit message] Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/nfs: fix NULL pointer dereference in URI parsingPeter Lieven2017-02-121-1/+2
| | | | | | | | | | | | parse_uint_full wants to put the parsed value into the variable passed via its second argument which is NULL. Fixes: 94d6a7a76e9df9919629428f6c598e2b97d9426c Cc: qemu-stable@nongnu.org Signed-off-by: Peter Lieven <pl@kamp.de> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1485942829-10756-2-git-send-email-pl@kamp.de Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/qapi: reduce the execution time of qmp_query_blockstatsDou Liyang2017-02-121-44/+29Star
| | | | | | | | | | | | | | | | | | | | | | | | | In order to reduce the execution time, this patch optimize the qmp_query_blockstats(): Remove the next_query_bds function. Remove the bdrv_query_stats function. Remove some judgement sentence. The original qmp_query_blockstats calls next_query_bds to get the next objects in each loops. In the next_query_bds, it checks the query_nodes and blk. It also call bdrv_query_stats to get the stats, In the bdrv_query_stats, it checks blk and bs each times. This waste more times, which may stall the main loop a bit. And if the disk is too many and donot use the dataplane feature, this may affect the performance in main loop thread. This patch removes that two functions, and makes the structure clearly. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Message-id: 1484467275-27919-3-git-send-email-douly.fnst@cn.fujitsu.com Reviewed-by: Markus Armbruster <armbru@redhat.com> [mreitz: Removed duplicate info->value assignment] Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/qapi: reduce the coupling between the bdrv_query_stats and ↵Dou Liyang2017-02-121-12/+14
| | | | | | | | | | | | | | | | | | | bdrv_query_bds_stats The bdrv_query_stats and bdrv_query_bds_stats functions need to call each other, that increases the coupling. it also makes the program complicated and makes some unnecessary tests. Remove the call from bdrv_query_bds_stats to bdrv_query_stats, just take some recursion to make it clearly. Avoid testing whether the blk is NULL during querying the bds stats. It is unnecessary. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Message-id: 1484467275-27919-2-git-send-email-douly.fnst@cn.fujitsu.com Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/vmdk: Fix the endian problem of buf_len and lbaQingFeng Hao2017-02-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | The problem was triggered by qemu-iotests case 055. It failed when it was comparing the compressed vmdk image with original test.img. The cause is that buf_len in vmdk_write_extent wasn't converted to little-endian before it was stored to disk. But later vmdk_read_extent read it and converted it from little-endian to cpu endian. If the cpu is big-endian like s390, the problem will happen and the data length read by vmdk_read_extent will become invalid! The fix is to add the conversion in vmdk_write_extent, meanwhile, repair the endianness problem of lba field which shall also be converted to little-endian before storing to disk. Cc: qemu-stable@nongnu.org Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> Signed-off-by: Jing Liu <liujbjl@linux.vnet.ibm.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20161216052040.53067-2-haoqf@linux.vnet.ibm.com Signed-off-by: Max Reitz <mreitz@redhat.com>
* qapi: Tweak error message of bdrv_query_image_infoFam Zheng2017-02-121-2/+2
| | | | | | | | | | @bs doesn't always have a device name, such as when it comes from "qemu-img info". Report file name instead. Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 20170119130759.28319-2-famz@redhat.com Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* Merge remote-tracking branch 'remotes/stefanha/tags/tracing-pull-request' ↵Peter Maydell2017-02-021-2/+0Star
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | into staging # gpg: Signature made Wed 01 Feb 2017 13:44:32 GMT # gpg: using RSA key 0x9CA4ABB381AB73C8 # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" # gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>" # Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8 * remotes/stefanha/tags/tracing-pull-request: trace: clean up trace-events files qapi: add missing trace_visit_type_enum() call trace: improve error reporting when parsing simpletrace header trace: update docs to reflect new code generation approach trace: switch to modular code generation for sub-directories trace: move setting of group name into Makefiles trace: move hw/i386/xen events to correct subdir trace: move hw/xen events to correct subdir trace: move hw/block/dataplane events to correct subdir make: move top level dir to end of include search path # Conflicts: # Makefile Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
| * trace: clean up trace-events filesStefan Hajnoczi2017-01-311-2/+0Star
| | | | | | | | | | | | | | | | | | | | | | | | There are a number of unused trace events that scripts/cleanup-trace-events.pl finds. The "hw/vfio/pci-quirks.c" filename was typoed and "qapi/qapi-visit-core.c" was missing the qapi/ directory prefix. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20170126171613.1399-3-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* | sheepdog: reorganize check for overlapping requestsPaolo Bonzini2017-02-011-36/+30Star
| | | | | | | | | | | | | | | | | | Wrap the code that was copied repeatedly in the two functions, sd_aio_setup and sd_aio_complete. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113245.32724-6-pbonzini@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
* | sheepdog: simplify inflight_aio_head managementPaolo Bonzini2017-02-011-17/+6Star
| | | | | | | | | | | | | | | | | | Add to the list in add_aio_request and, indirectly, resend_aioreq. Inline free_aio_req in the caller, it does not simply undo alloc_aio_req's job. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113245.32724-5-pbonzini@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
* | sheepdog: do not use BlockAIOCBPaolo Bonzini2017-02-011-60/+39Star
| | | | | | | | | | | | | | | | | | Sheepdog's AIOCB are completely internal entities for a group of requests and do not need dynamic allocation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113245.32724-4-pbonzini@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
* | sheepdog: reorganize coroutine flowPaolo Bonzini2017-02-011-73/+42Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delimit co_recv's lifetime clearly in aio_read_response. Do a simple qemu_coroutine_enter in aio_read_response, letting sd_co_writev call sd_write_done. Handle nr_pending in the same way in sd_co_rw_vector, sd_write_done and sd_co_flush_to_disk. Remove sd_co_rw_vector's return value; just leave with no pending requests. [Jeff: added missing 'return' back, spotted by Paolo after series was applied.] Signed-off-by: Jeff Cody <jcody@redhat.com>
* | sheepdog: remove unused cancellation supportPaolo Bonzini2017-02-011-52/+0Star
|/ | | | | | | | SheepdogAIOCB is internal to sheepdog.c, hence it is never canceled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113245.32724-2-pbonzini@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
* block/iscsi: statically link qemu_iscsi_optsPeter Lieven2017-01-272-0/+70
| | | | | | | | | | | | commit f57b4b5f moved qemu_iscsi_opts into vl.c. This made them invisible for qemu-img, qemu-nbd etc. Fixes: f57b4b5fb127b60e1aade2684a8b16bc4f630b29 Cc: qemu-stable@nongnu.org Signed-off-by: Peter Lieven <pl@kamp.de> Message-Id: <1485262161-18543-1-git-send-email-pl@kamp.de> [Drop useless #ifdef. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* block: get max_transfer limit for char (scsi-generic) devicesEric Farman2017-01-271-1/+1
| | | | | | | | | | | | | We can get the maximum number of bytes for a single I/O transfer from the BLKSECTGET ioctl, but we only perform this for block devices. scsi-generic devices are represented as character devices, and so do not issue this today. Update this, so that virtio-scsi devices using the scsi-generic interface can return the same data. Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com> Message-Id: <20170120162527.66075-4-farman@linux.vnet.ibm.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* block: Fix target variable of BLKSECTGET ioctlEric Farman2017-01-271-7/+10
| | | | | | | | | | | | | | | | | | | | Commit 6f6071745bd0 ("raw-posix: Fetch max sectors for host block device") introduced a routine to call the kernel BLKSECTGET ioctl, which stores the result back to user space. However, the size of the data returned depends on the routine handling the ioctl. The (compat_)blkdev_ioctl returns a short, while sg_ioctl returns an int. Thus, on big-endian systems, we can find ourselves accidentally shifting the result to a much larger value. (On s390x, a short is 16 bits while an int is 32 bits.) Also, the two ioctl handlers return values in different scales (block returns sectors, while sg returns bytes), so some tweaking of the outputs is required such that hdev_get_max_transfer_length returns a value in a consistent set of units. Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com> Message-Id: <20170120162527.66075-3-farman@linux.vnet.ibm.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* block/iscsi: avoid data corruption with cache=writebackPeter Lieven2017-01-271-2/+6
| | | | | | | | | | | | | | | nb_cls_shrunk in iscsi_allocmap_update can become -1 if the request starts and ends within the same cluster. This results in passing -1 to bitmap_set and bitmap_clear and they don't handle negative values properly. In the end this leads to data corruption. Fixes: e1123a3b40a1a9a625a29c8ed4debb7e206ea690 Cc: qemu-stable@nongnu.org Signed-off-by: Peter Lieven <pl@kamp.de> Message-Id: <1484579832-18589-1-git-send-email-pl@kamp.de> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* migration: disallow migrate_add_blocker during migrationAshijeet Acharya2017-01-246-19/+53
| | | | | | | | | | | | | | | If a migration is already in progress and somebody attempts to add a migration blocker, this should rightly fail. Add an errp parameter and a retcode return value to migrate_add_blocker. Signed-off-by: John Snow <jsnow@redhat.com> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com> Message-Id: <1484566314-3987-5-git-send-email-ashijeetacharya@gmail.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Acked-by: Greg Kurz <groug@kaod.org> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Merged with recent 'Allow invtsc migration' change
* block/vvfat: Remove the undesirable commentAshijeet Acharya2017-01-241-1/+0Star
| | | | | | | | | Remove the "// assert(is_consistent(s))" comment in block/vvfat.c Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com> Message-Id: <1484566314-3987-2-git-send-email-ashijeetacharya@gmail.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
* block: get rid of bdrv_io_unplugged_begin/endPaolo Bonzini2017-01-161-39/+2Star
| | | | | | | | | | | | | | | bdrv_io_plug and bdrv_io_unplug are only called (via their BlockBackend equivalents) after starting asynchronous I/O. bdrv_drain is not going to be called while they are running, because---even if a coroutine runs for some reason---it will only drain in the next iteration of the event loop through bdrv_co_yield_to_drain. So this mechanism is unnecessary, get rid of it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113334.605-1-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: Rename raw-{posix,win32} to file-*.cEric Blake2017-01-095-6/+6
| | | | | | | | | | | | | | These files deal with the file protocol, not the raw format (the file protocol is often used with other formats, and the raw format is not forced to use the file protocol). Rename things to make it a bit easier to follow. Suggested-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Rename raw_bsd to raw-format.cEric Blake2017-01-092-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given that we have raw-win32.c and raw-posix.c, my initial guess at raw_bsd.c was that it was for dealing with raw files using code specific to the BSD operating system (beyond what raw-posix could do). Not so - this name was chosen back in commit e1c66c6 to distinguish that it was a BSD licensed file, in contrast to the then-existing raw.c with an unclear and potentially unusable license. But since it has been more than three years since the rewrite, it's time to pick a more useful name for this file to avoid this type of confusion to future contributors that don't know the backstory, as none of our other files are named solely by the license they use. In reality, this file deals with the raw format, which is useful with any number of protocols, while raw-{win32,posix} deal with the file protocol (and in turn, that protocol is not limited to use with the raw format). So rename raw_bsd to raw-format.c. We could have also used the shorter name raw.c, except that collides with the earlier use of that filename for a different license, and it's better to be safe than risk license pollution. The next patch will also rename raw-win32.c and raw-posix.c to further distinguish the difference in roles. It doesn't hurt that this gets rid of an underscore in the filename, thereby making tab-completion on 'ra<TAB>' easier (now I don't have to type the shift key, which slows things down :) Suggested-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* blkverify: Implement bdrv_co_preadv/pwritev/flushKevin Wolf2017-01-091-105/+96Star
| | | | | | | | | | | | This enables byte granularity requests for blkverify, and at the same time gets us rid of another user of the BDS-level AIO emulation. The reference output of a test case must be changed because the verification failure message reports byte offsets instead of sectors now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* blkdebug: Implement bdrv_co_preadv/pwritev/flushKevin Wolf2017-01-091-46/+40Star
| | | | | | | | | | | | This enables byte granularity requests for blkdebug, and at the same time gets us rid of another user of the BDS-level AIO emulation. Note that unless align=512 is specified, this can behave subtly different from the old behaviour because bdrv_co_preadv/pwritev don't have to perform alignment adjustments any more. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* quorum: Clean up quorum_aio_get()Kevin Wolf2017-01-091-13/+10Star
| | | | | | | | | | | | Make sure that all fields of the new QuorumAIOCB are zeroed when the function returns even without explicitly setting them. This will protect us when new fields are added, removes some explicit zero assignment and makes the code a little nicer to read. Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com>
* quorum: Inline quorum_fifo_aio_cb()Kevin Wolf2017-01-091-29/+13Star
| | | | | | | | | | Inlining the function removes some boilerplace code and replaces recursion by a simple loop, so the code becomes somewhat easier to understand. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* quorum: Implement .bdrv_co_preadv/pwritev()Kevin Wolf2017-01-091-43/+38Star
| | | | | | | | | | | | This enables byte granularity requests on quorum nodes. Note that the QMP events emitted by the driver are an external API that we were careless enough to define as sector based. The offset and length of requests reported in events are rounded therefore. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com>
* quorum: Avoid bdrv_aio_writev() for rewritesKevin Wolf2017-01-091-15/+31
| | | | | | | | | Replacing it with bdrv_co_pwritev() prepares us for byte granularity requests and gets us rid of the last bdrv_aio_*() user in quorum. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* quorum: Inline quorum_aio_cb()Kevin Wolf2017-01-091-69/+59Star
| | | | | | | | | This is a conversion to a more natural coroutine style and improves the readability of the driver. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* quorum: Do cleanup in caller coroutineKevin Wolf2017-01-091-6/+9
| | | | | | | | | | Instead of calling quorum_aio_finalize() deeply nested in what used to be an AIO callback, do it in the same functions that allocated the AIOCB. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* quorum: Implement .bdrv_co_readv/writevKevin Wolf2017-01-091-77/+115
| | | | | | | | | | This converts the quorum block driver from implementing callback-based interfaces for read/write to coroutine-based ones. This is the first step that will allow us further simplification of the code. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com>
* quorum: Remove s from quorum_aio_get() argumentsKevin Wolf2017-01-091-5/+4Star
| | | | | | | | | | There is no point in passing the value of bs->opaque in order to overwrite it with itself. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com>
* linux-aio: poll ring for completionsStefan Hajnoczi2017-01-031-1/+16
| | | | | | | | | | | | | The Linux AIO userspace ABI includes a ring that is shared with the kernel. This allows userspace programs to process completions without system calls. Add an AioContext poll handler to check for completions in the ring. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-6-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* aio: add AioPollFn and io_poll() interfaceStefan Hajnoczi2017-01-038-31/+33
| | | | | | | | | | | | The new AioPollFn io_poll() argument to aio_set_fd_handler() and aio_set_event_handler() is used in the next patch. Keep this code change separate due to the number of files it touches. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-3-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* Merge remote-tracking branch 'kwolf/tags/for-upstream' into stagingStefan Hajnoczi2016-12-061-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer patches for 2.8.0-rc3 # gpg: Signature made Tue 06 Dec 2016 02:44:39 PM GMT # gpg: using RSA key 0x7F09B272C88F2FD6 # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * kwolf/tags/for-upstream: qcow2: Don't strand clusters near 2G intervals during commit Message-id: 1481037418-10239-1-git-send-email-kwolf@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * qcow2: Don't strand clusters near 2G intervals during commitEric Blake2016-12-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The qcow2_make_empty() function is reached during 'qemu-img commit', in order to clear out ALL clusters of an image. However, if the image cannot use the fast code path (true if the image is format 0.10, or if the image contains a snapshot), the cluster size is larger than 512, and the image is larger than 2G in size, then our choice of sector_step causes problems. Since it is not cluster aligned, but qcow2_discard_clusters() silently ignores an unaligned head or tail, we are leaving clusters allocated. Enhance the testsuite to expose the flaw, and patch the problem by ensuring our step size is aligned. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* | block/nfs: fix QMP to match debug optionPrasanna Kumar Kalever2016-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The QMP definition of BlockdevOptionsNfs: { 'struct': 'BlockdevOptionsNfs', 'data': { 'server': 'NFSServer', 'path': 'str', '*user': 'int', '*group': 'int', '*tcp-syn-count': 'int', '*readahead-size': 'int', '*page-cache-size': 'int', '*debug-level': 'int' } } To make this consistent with other block protocols like gluster, lets change s/debug-level/debug/ Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Jeff Cody <jcody@redhat.com>
* | block/gluster: fix QMP to match debug optionPrasanna Kumar Kalever2016-12-051-20/+20
|/ | | | | | | | | | | | | | | | | | | | | The QMP definition of BlockdevOptionsGluster: { 'struct': 'BlockdevOptionsGluster', 'data': { 'volume': 'str', 'path': 'str', 'server': ['GlusterServer'], '*debug-level': 'int', '*logfile': 'str' } } But instead of 'debug-level we have exported 'debug' as the option for choosing debug level of gluster protocol driver. This patch fix QMP definition BlockdevOptionsGluster s/debug-level/debug/ Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Prasanna Kumar Kalever <prasanna.kalever@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Jeff Cody <jcody@redhat.com>
* Merge remote-tracking branch 'kwolf/tags/for-upstream' into stagingStefan Hajnoczi2016-11-292-3/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer patches for 2.8.0-rc2 # gpg: Signature made Tue 29 Nov 2016 03:16:10 PM GMT # gpg: using RSA key 0x7F09B272C88F2FD6 # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * kwolf/tags/for-upstream: docs: Specify that cache-clean-interval is only supported in Linux qcow2: Remove stale comment qcow2: Allow 'cache-clean-interval' in Linux only qcow2: Make qcow2_cache_table_release() work only in Linux Message-id: 1480436227-2211-1-git-send-email-kwolf@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * qcow2: Remove stale commentAlberto Garcia2016-11-251-1/+0Star
| | | | | | | | | | | | | | We haven't been using CONFIG_MADVISE since 02d0e095031b7fda77de8b Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qcow2: Allow 'cache-clean-interval' in Linux onlyAlberto Garcia2016-11-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cache-clean-interval option of qcow2 only works on Linux. However we allow setting it in other systems regardless of whether it works or not. In those systems this option is not simply a no-op: it actually invalidates perfectly valid cache tables for no good reason without freeing their memory. This patch forbids using that option in non-Linux systems. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * qcow2: Make qcow2_cache_table_release() work only in LinuxAlberto Garcia2016-11-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We are using QEMU_MADV_DONTNEED to discard the memory of individual L2 cache tables. The problem with this is that those semantics are specific to the Linux madvise() system call. Other implementations of madvise() (including the very Linux implementation of posix_madvise()) don't do that, so we cannot use them for the same purpose. This patch makes the code Linux-specific and uses madvise() directly since there's no point in going through qemu_madvise() for this. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* | Merge remote-tracking branch 'bonzini/tags/for-upstream' into stagingStefan Hajnoczi2016-11-231-0/+4
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Small fixes for rc1. # gpg: Signature made Tue 22 Nov 2016 10:26:56 PM GMT # gpg: using RSA key 0xBFFBD25F78C7AE83 # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" # Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1 # Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83 * bonzini/tags/for-upstream: scsi/esp: do not raise an interrupt when reading the FIFO register nbd: Allow unmap and fua during write zeroes cpu_ldst.h: use correct guest address parameter Message-id: 1479853676-35995-1-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * nbd: Allow unmap and fua during write zeroesEric Blake2016-11-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fa778fff wired up support to send the NBD_CMD_WRITE_ZEROES, but forgot to inform the block layer that FUA unmapping of zeroes is supported. Without BDRV_REQ_MAY_UNMAP listed as a supported flag, the block layer will always insist on the NBD layer passing NBD_CMD_FLAG_NO_HOLE, resulting in the server always allocating things even when it was desired to let the server punch holes. Similarly, failing to set BDRV_REQ_FUA means that the client may send unnecessary NBD_CMD_FLUSH when it could have instead used the NBD_CMD_FLAG_FUA bit. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1479413642-22463-2-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | block: Pass unaligned discard requests to driversEric Blake2016-11-221-13/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Discard is advisory, so rounding the requests to alignment boundaries is never semantically wrong from the data that the guest sees. But at least the Dell Equallogic iSCSI SANs has an interesting property that its advertised discard alignment is 15M, yet documents that discarding a sequence of 1M slices will eventually result in the 15M page being marked as discarded, and it is possible to observe which pages have been discarded. Between commits 9f1963b and b8d0a980, we converted the block layer to a byte-based interface that ultimately ignores any unaligned head or tail based on the driver's advertised discard granularity, which means that qemu 2.7 refuses to pass any discard request smaller than 15M down to the Dell Equallogic hardware. This is a slight regression in behavior compared to earlier qemu, where a guest executing discards in power-of-2 chunks used to be able to get every page discarded, but is now left with various pages still allocated because the guest requests did not align with the hardware's 15M pages. Since the SCSI specification says nothing about a minimum discard granularity, and only documents the preferred alignment, it is best if the block layer gives the driver every bit of information about discard requests, rather than rounding it to alignment boundaries early. Rework the block layer discard algorithm to mirror the write zero algorithm: always peel off any unaligned head or tail and manage that in isolation, then do the bulk of the request on an aligned boundary. The fallback when the driver returns -ENOTSUP for an unaligned request is to silently ignore that portion of the discard request; but for devices that can pass the partial request all the way down to hardware, this can result in the hardware coalescing requests and discarding aligned pages after all. Reported by: Peter Lieven <pl@kamp.de> CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* | block: Return -ENOTSUP rather than assert on unaligned discardsEric Blake2016-11-223-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, the block layer rounds discard requests, so that individual drivers are able to assert that discard requests will never be unaligned. But there are some ISCSI devices that track and coalesce multiple unaligned requests, turning it into an actual discard if the requests eventually cover an entire page, which implies that it is better to always pass discard requests as low down the stack as possible. In isolation, this patch has no semantic effect, since the block layer currently never passes an unaligned request through. But the block layer already has code that silently ignores drivers that return -ENOTSUP for a discard request that cannot be honored (as well as drivers that return 0 even when nothing was done). But the next patch will update the block layer to fragment discard requests, so that clients are guaranteed that they are either dealing with an unaligned head or tail, or an aligned core, making it similar to the block layer semantics of write zero fragmentation. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* | block: Let write zeroes fallback work even with small max_transferEric Blake2016-11-221-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 443668ca rewrote the write_zeroes logic to guarantee that an unaligned request never crosses a cluster boundary. But in the rewrite, the new code assumed that at most one iteration would be needed to get to an alignment boundary. However, it is easy to trigger an assertion failure: the Linux kernel limits loopback devices to advertise a max_transfer of only 64k. Any operation that requires falling back to writes rather than more efficient zeroing must obey max_transfer during that fallback, which means an unaligned head may require multiple iterations of the write fallbacks before reaching the aligned boundaries, when layering a format with clusters larger than 64k atop the protocol of file access to a loopback device. Test case: $ qemu-img create -f qcow2 -o cluster_size=1M file 10M $ losetup /dev/loop2 /path/to/file $ qemu-io -f qcow2 /dev/loop2 qemu-io> w 7m 1k qemu-io> w -z 8003584 2093056 In fairness to Denis (as the original listed author of the culprit commit), the faulty logic for at most one iteration is probably all my fault in reworking his idea. But the solution is to restore what was in place prior to that commit: when dealing with an unaligned head or tail, iterate as many times as necessary while fragmenting the operation at max_transfer boundaries. Reported-by: Ed Swierk <eswierk@skyportsystems.com> CC: qemu-stable@nongnu.org CC: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>