summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* file-posix: Mitigate file fragmentation with extent size hintsKevin Wolf2020-07-141-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Especially when O_DIRECT is used with image files so that the page cache indirection can't cause a merge of allocating requests, the file will fragment on the file system layer, with a potentially very small fragment size (this depends on the requests the guest sent). On Linux, fragmentation can be reduced by setting an extent size hint when creating the file (at least on XFS, it can't be set any more after the first extent has been allocated), basically giving raw files a "cluster size" for allocation. This adds a create option to set the extent size hint, and changes the default from not setting a hint to setting it to 1 MB. The main reason why qcow2 defaults to smaller cluster sizes is that COW becomes more expensive, which is not an issue with raw files, so we can choose a larger size. The tradeoff here is only potentially wasted disk space. For qcow2 (or other image formats) over file-posix, the advantage should even be greater because they grow sequentially without leaving holes, so there won't be wasted space. Setting even larger extent size hints for such images may make sense. This can be done with the new option, but let's keep the default conservative for now. The effect is very visible with a test that intentionally creates a badly fragmented file with qemu-img bench (the time difference while creating the file is already remarkable) and then looks at the number of extents and the time a simple "qemu-img map" takes. Without an extent size hint: $ ./qemu-img create -f raw -o extent_size_hint=0 ~/tmp/test.raw 10G Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=0 $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0 Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192) Run completed in 25.848 seconds. $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096 Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192) Run completed in 19.616 seconds. $ filefrag ~/tmp/test.raw /home/kwolf/tmp/test.raw: 2000000 extents found $ time ./qemu-img map ~/tmp/test.raw Offset Length Mapped to File 0 0x1e8480000 0 /home/kwolf/tmp/test.raw real 0m1,279s user 0m0,043s sys 0m1,226s With the new default extent size hint of 1 MB: $ ./qemu-img create -f raw -o extent_size_hint=1M ~/tmp/test.raw 10G Formatting '/home/kwolf/tmp/test.raw', fmt=raw size=10737418240 extent_size_hint=1048576 $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 0 Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 8192) Run completed in 11.833 seconds. $ ./qemu-img bench -f raw -t none -n -w ~/tmp/test.raw -c 1000000 -S 8192 -o 4096 Sending 1000000 write requests, 4096 bytes each, 64 in parallel (starting at offset 4096, step size 8192) Run completed in 10.155 seconds. $ filefrag ~/tmp/test.raw /home/kwolf/tmp/test.raw: 178 extents found $ time ./qemu-img map ~/tmp/test.raw Offset Length Mapped to File 0 0x1e8480000 0 /home/kwolf/tmp/test.raw real 0m0,061s user 0m0,040s sys 0m0,014s Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20200707142329.48303-1-kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* iscsi: return -EIO when sense fields are meaninglessXie Yongji2020-07-111-0/+2
| | | | | | | | | | When an I/O request failed, now we only return correct value on scsi check condition. We should also have a default errno such as -EIO in other case. Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Message-Id: <20200701105444.3226-2-xieyongji@bytedance.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* iscsi: handle check condition status in retry loopXie Yongji2020-07-111-10/+10
| | | | | | | | | | The handling of check condition was incorrect because we would only do it after retries exceed maximum. Fixes: 8c460269aa ("iscsi: base all handling of check condition on scsi_sense_to_errno") Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Message-Id: <20200701105444.3226-1-xieyongji@bytedance.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* nbd: Use ERRP_GUARD()Vladimir Sementsov-Ogievskiy2020-07-101-4/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we want to check error after errp-function call, we need to introduce local_err and then propagate it to errp. Instead, use the ERRP_GUARD() macro, benefits are: 1. No need of explicit error_propagate call 2. No need of explicit local_err variable: use errp directly 3. ERRP_GUARD() leaves errp as is if it's not NULL or &error_fatal, this means that we don't break error_abort (we'll abort on error_set, not on error_propagate) If we want to add some info to errp (by error_prepend() or error_append_hint()), we must use the ERRP_GUARD() macro. Otherwise, this info will not be added when errp == &error_fatal (the program will exit prior to the error_append_hint() or error_prepend() call). Fix several such cases, e.g. in nbd_read(). This commit is generated by command sed -n '/^Network Block Device (NBD)$/,/^$/{s/^F: //p}' \ MAINTAINERS | \ xargs git ls-files | grep '\.[hc]$' | \ xargs spatch \ --sp-file scripts/coccinelle/errp-guard.cocci \ --macro-file scripts/cocci-macro-file.h \ --in-place --no-show-diff --max-width 80 Reported-by: Kevin Wolf <kwolf@redhat.com> Reported-by: Greg Kurz <groug@kaod.org> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> [Commit message tweaked] Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20200707165037.1026246-8-armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> [ERRP_AUTO_PROPAGATE() renamed to ERRP_GUARD(), and auto-propagated-errp.cocci to errp-guard.cocci. Commit message tweaked again.]
* error: Avoid error_propagate() after migrate_add_blocker()Markus Armbruster2020-07-107-25/+14Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When migrate_add_blocker(blocker, &errp) is followed by error_propagate(errp, err), we can often just as well do migrate_add_blocker(..., errp). Do that with this Coccinelle script: @@ expression blocker, err, errp; expression ret; @@ - ret = migrate_add_blocker(blocker, &err); - if (err) { + ret = migrate_add_blocker(blocker, errp); + if (ret < 0) { ... when != err; - error_propagate(errp, err); ... } @@ expression blocker, err, errp; @@ - migrate_add_blocker(blocker, &err); - if (err) { + if (migrate_add_blocker(blocker, errp) < 0) { ... when != err; - error_propagate(errp, err); ... } Double-check @err is not used afterwards. Dereferencing it would be use after free, but checking whether it's null would be legitimate. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-43-armbru@redhat.com>
* qapi: Smooth another visitor error checking patternMarkus Armbruster2020-07-1011-54/+22Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | Convert visit_type_FOO(v, ..., &ptr, &err); ... if (err) { ... } to visit_type_FOO(v, ..., &ptr, errp); ... if (!ptr) { ... } for functions that set @ptr to non-null / null on success / error. Eliminate error_propagate() that are now unnecessary. Delete @err that are now unused. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-40-armbru@redhat.com>
* block/parallels: Simplify parallels_open() after previous commitMarkus Armbruster2020-07-101-5/+2Star
| | | | | | | Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-39-armbru@redhat.com>
* error: Reduce unnecessary error propagationMarkus Armbruster2020-07-105-10/+9Star
| | | | | | | | | | | When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away, even when we need to keep error_propagate() for other error paths. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-38-armbru@redhat.com>
* error: Eliminate error_propagate() manuallyMarkus Armbruster2020-07-101-3/+1Star
| | | | | | | | | | | When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away. The previous two commits did that for sufficiently simple cases with Coccinelle. Do it for several more manually. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-37-armbru@redhat.com>
* error: Eliminate error_propagate() with Coccinelle, part 2Markus Armbruster2020-07-1017-61/+25Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away. The previous commit did that with a Coccinelle script I consider fairly trustworthy. This commit uses the same script with the matching of return taken out, i.e. we convert if (!foo(..., &err)) { ... error_propagate(errp, err); ... } to if (!foo(..., errp)) { ... ... } This is unsound: @err could still be read between afterwards. I don't know how to express "no read of @err without an intervening write" in Coccinelle. Instead, I manually double-checked for uses of @err. Suboptimal line breaks tweaked manually. qdev_realize() simplified further to placate scripts/checkpatch.pl. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-36-armbru@redhat.com>
* error: Eliminate error_propagate() with Coccinelle, part 1Markus Armbruster2020-07-109-24/+10Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When all we do with an Error we receive into a local variable is propagating to somewhere else, we can just as well receive it there right away. Convert if (!foo(..., &err)) { ... error_propagate(errp, err); ... return ... } to if (!foo(..., errp)) { ... ... return ... } where nothing else needs @err. Coccinelle script: @rule1 forall@ identifier fun, err, errp, lbl; expression list args, args2; binary operator op; constant c1, c2; symbol false; @@ if ( ( - fun(args, &err, args2) + fun(args, errp, args2) | - !fun(args, &err, args2) + !fun(args, errp, args2) | - fun(args, &err, args2) op c1 + fun(args, errp, args2) op c1 ) ) { ... when != err when != lbl: when strict - error_propagate(errp, err); ... when != err ( return; | return c2; | return false; ) } @rule2 forall@ identifier fun, err, errp, lbl; expression list args, args2; expression var; binary operator op; constant c1, c2; symbol false; @@ - var = fun(args, &err, args2); + var = fun(args, errp, args2); ... when != err if ( ( var | !var | var op c1 ) ) { ... when != err when != lbl: when strict - error_propagate(errp, err); ... when != err ( return; | return c2; | return false; | return var; ) } @depends on rule1 || rule2@ identifier err; @@ - Error *err = NULL; ... when != err Not exactly elegant, I'm afraid. The "when != lbl:" is necessary to avoid transforming if (fun(args, &err)) { goto out } ... out: error_propagate(errp, err); even though other paths to label out still need the error_propagate(). For an actual example, see sclp_realize(). Without the "when strict", Coccinelle transforms vfio_msix_setup(), incorrectly. I don't know what exactly "when strict" does, only that it helps here. The match of return is narrower than what I want, but I can't figure out how to express "return where the operand doesn't use @err". For an example where it's too narrow, see vfio_intx_enable(). Silently fails to convert hw/arm/armsse.c, because Coccinelle gets confused by ARMSSE being used both as typedef and function-like macro there. Converted manually. Line breaks tidied up manually. One nested declaration of @local_err deleted manually. Preexisting unwanted blank line dropped in hw/riscv/sifive_e.c. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-35-armbru@redhat.com>
* error: Avoid unnecessary error_propagate() after error_setg()Markus Armbruster2020-07-104-33/+25Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace error_setg(&err, ...); error_propagate(errp, err); by error_setg(errp, ...); Related pattern: if (...) { error_setg(&err, ...); goto out; } ... out: error_propagate(errp, err); return; When all paths to label out are that way, replace by if (...) { error_setg(errp, ...); return; } and delete the label along with the error_propagate(). When we have at most one other path that actually needs to propagate, and maybe one at the end that where propagation is unnecessary, e.g. foo(..., &err); if (err) { goto out; } ... bar(..., &err); out: error_propagate(errp, err); return; move the error_propagate() to where it's needed, like if (...) { foo(..., &err); error_propagate(errp, err); return; } ... bar(..., errp); return; and transform the error_setg() as above. In some places, the transformation results in obviously unnecessary error_propagate(). The next few commits will eliminate them. Bonus: the elimination of gotos will make later patches in this series easier to review. Candidates for conversion tracked down with this Coccinelle script: @@ identifier err, errp; expression list args; @@ - error_setg(&err, args); + error_setg(errp, args); ... when != err error_propagate(errp, err); Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200707160613.848843-34-armbru@redhat.com>
* qapi: Use returned bool to check for failure, manual partMarkus Armbruster2020-07-101-3/+2Star
| | | | | | | | | | | | The previous commit used Coccinelle to convert from checking the Error object to checking the return value. Convert a few more manually. Also tweak control flow in places to conform to the conventional "if error bail out" pattern. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-20-armbru@redhat.com>
* qapi: Use returned bool to check for failure, Coccinelle partMarkus Armbruster2020-07-104-10/+5Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit enables conversion of visit_foo(..., &err); if (err) { ... } to if (!visit_foo(..., errp)) { ... } for visitor functions that now return true / false on success / error. Coccinelle script: @@ identifier fun =~ "check_list|input_type_enum|lv_start_struct|lv_type_bool|lv_type_int64|lv_type_str|lv_type_uint64|output_type_enum|parse_type_bool|parse_type_int64|parse_type_null|parse_type_number|parse_type_size|parse_type_str|parse_type_uint64|print_type_bool|print_type_int64|print_type_null|print_type_number|print_type_size|print_type_str|print_type_uint64|qapi_clone_start_alternate|qapi_clone_start_list|qapi_clone_start_struct|qapi_clone_type_bool|qapi_clone_type_int64|qapi_clone_type_null|qapi_clone_type_number|qapi_clone_type_str|qapi_clone_type_uint64|qapi_dealloc_start_list|qapi_dealloc_start_struct|qapi_dealloc_type_anything|qapi_dealloc_type_bool|qapi_dealloc_type_int64|qapi_dealloc_type_null|qapi_dealloc_type_number|qapi_dealloc_type_str|qapi_dealloc_type_uint64|qobject_input_check_list|qobject_input_check_struct|qobject_input_start_alternate|qobject_input_start_list|qobject_input_start_struct|qobject_input_type_any|qobject_input_type_bool|qobject_input_type_bool_keyval|qobject_input_type_int64|qobject_input_type_int64_keyval|qobject_input_type_null|qobject_input_type_number|qobject_input_type_number_keyval|qobject_input_type_size_keyval|qobject_input_type_str|qobject_input_type_str_keyval|qobject_input_type_uint64|qobject_input_type_uint64_keyval|qobject_output_start_list|qobject_output_start_struct|qobject_output_type_any|qobject_output_type_bool|qobject_output_type_int64|qobject_output_type_null|qobject_output_type_number|qobject_output_type_str|qobject_output_type_uint64|start_list|visit_check_list|visit_check_struct|visit_start_alternate|visit_start_list|visit_start_struct|visit_type_.*"; expression list args; typedef Error; Error *err; @@ - fun(args, &err); - if (err) + if (!fun(args, &err)) { ... } A few line breaks tidied up manually. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-19-armbru@redhat.com>
* qemu-option: Use returned bool to check for failureMarkus Armbruster2020-07-1021-56/+28Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit enables conversion of foo(..., &err); if (err) { ... } to if (!foo(..., &err)) { ... } for QemuOpts functions that now return true / false on success / error. Coccinelle script: @@ identifier fun = { opts_do_parse, parse_option_bool, parse_option_number, parse_option_size, qemu_opt_parse, qemu_opt_rename, qemu_opt_set, qemu_opt_set_bool, qemu_opt_set_number, qemu_opts_absorb_qdict, qemu_opts_do_parse, qemu_opts_from_qdict_entry, qemu_opts_set, qemu_opts_validate }; expression list args, args2; typedef Error; Error *err; @@ - fun(args, &err, args2); - if (err) + if (!fun(args, &err, args2)) { ... } A few line breaks tidied up manually. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200707160613.848843-15-armbru@redhat.com> [Conflict with commit 0b6786a9c1 "block/amend: refactor qcow2 amend options" resolved by rerunning Coccinelle on master's version]
* qemu-option: Check return value instead of @err where convenientMarkus Armbruster2020-07-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert uses like opts = qemu_opts_create(..., &err); if (err) { ... } to opts = qemu_opts_create(..., errp); if (!opts) { ... } Eliminate error_propagate() that are now unnecessary. Delete @err that are now unused. Note that we can't drop parallels_open()'s error_propagate() here. We continue to execute it even in the converted case. It's a no-op then: local_err is null. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Greg Kurz <groug@kaod.org> Message-Id: <20200707160613.848843-8-armbru@redhat.com>
* qed: Simplify backing readsEric Blake2020-07-062-59/+6Star
| | | | | | | | | | | | The other four drivers that support backing files (qcow, qcow2, parallels, vmdk) all rely on the block layer to populate zeroes when reading beyond EOF of a short backing file. We can simplify the qed code by doing likewise. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200528094405.145708-11-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block: drop unallocated_blocks_are_zeroVladimir Sementsov-Ogievskiy2020-07-063-9/+2Star
| | | | | | | | | | | | | | | Currently this field only set by qed and qcow2. But in fact, all backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share these semantics: on unallocated blocks, if there is no backing file they just memset the buffer with zeroes. So, document this behavior for .supports_backing and drop .unallocated_blocks_are_zero Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-10-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/vhdx: drop unallocated_blocks_are_zeroVladimir Sementsov-Ogievskiy2020-07-061-3/+0Star
| | | | | | | | | | | | vhdx doesn't have .bdrv_co_block_status handler, so DATA|ALLOCATED is always assumed for it in bdrv_co_block_status(). unallocated_blocks_are_zero is useless (it doesn't affect the only user of the field: bdrv_co_block_status()), drop it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-9-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/file-posix: drop unallocated_blocks_are_zeroVladimir Sementsov-Ogievskiy2020-07-061-3/+0Star
| | | | | | | | | | | raw_co_block_status() in block/file-posix.c never returns 0, so unallocated_blocks_are_zero is useless (it doesn't affect the only user of the field: bdrv_co_block_status()). Drop it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-8-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/iscsi: drop unallocated_blocks_are_zeroVladimir Sementsov-Ogievskiy2020-07-061-1/+0Star
| | | | | | | | | | | | | We set bdi->unallocated_blocks_are_zero = iscsilun->lbprz, but iscsi_co_block_status doesn't return 0 in case of iscsilun->lbprz, it returns ZERO when appropriate. So actually unallocated_blocks_are_zero is useless (it doesn't affect the only user of the field: bdrv_co_block_status()). Drop it now. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-7-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/crypto: drop unallocated_blocks_are_zeroVladimir Sementsov-Ogievskiy2020-07-061-1/+0Star
| | | | | | | | | | It's false by default, no needs to set it. We are going to drop this variable at all, so drop it now here, it doesn't hurt. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-6-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/vpc: return ZERO block-status when appropriateVladimir Sementsov-Ogievskiy2020-07-061-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case when get_image_offset() returns -1, we do zero out the corresponding chunk of qiov. So, this should be reported as ZERO. Note that this changes visible output of "qemu-img map --output=json" and "qemu-io -c map" commands. For qemu-img map, the change is obvious: we just mark as zero what is really zero. For qemu-io it's less obvious: what was unallocated now is allocated. There is an inconsistency in understanding of unallocated regions in Qemu: backing-supporting format-drivers return 0 block-status to report go-to-backing logic for this area. Some protocol-drivers (iscsi) return 0 to report fs-unallocated-non-zero status (i.e., don't occupy space on disk, read result is undefined). BDRV_BLOCK_ALLOCATED is defined as something more close to go-to-backing logic. Still it is calculated as ZERO | DATA, so 0 from iscsi is treated as unallocated. It doesn't influence backing-chain behavior, as iscsi can't have backing file. But it does influence "qemu-io -c map". We should solve this inconsistency at some future point. Now, let's just make backing-not-supporting format drivers (vdi in the previous patch and vpc now) to behave more like backing-supporting drivers and not report 0 block-status. More over, returning ZERO status is absolutely valid thing, and again, corresponds to how the other format-drivers (backing-supporting) work. After block-status update, it never reports 0, so setting unallocated_blocks_are_zero doesn't make sense (as the only user of it is bdrv_co_block_status and it checks unallocated_blocks_are_zero only for unallocated areas). Drop it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-5-vsementsov@virtuozzo.com> [mreitz: qemu-io -c map as used by iotest 146 now reports everything as allocated; in order to make the test do something useful, we use qemu-img map --output=json now] Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/vdi: return ZERO block-status when appropriateVladimir Sementsov-Ogievskiy2020-07-061-2/+1Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of !VDI_IS_ALLOCATED[], we do zero out the corresponding chunk of qiov. So, this should be reported as ZERO. Note that this changes visible output of "qemu-img map --output=json" and "qemu-io -c map" commands. For qemu-img map, the change is obvious: we just mark as zero what is really zero. For qemu-io it's less obvious: what was unallocated now is allocated. There is an inconsistency in understanding of unallocated regions in Qemu: backing-supporting format-drivers return 0 block-status to report go-to-backing logic for this area. Some protocol-drivers (iscsi) return 0 to report fs-unallocated-non-zero status (i.e., don't occupy space on disk, read result is undefined). BDRV_BLOCK_ALLOCATED is defined as something more close to go-to-backing logic. Still it is calculated as ZERO | DATA, so 0 from iscsi is treated as unallocated. It doesn't influence backing-chain behavior, as iscsi can't have backing file. But it does influence "qemu-io -c map". We should solve this inconsistency at some future point. Now, let's just make backing-not-supporting format drivers (vdi at this patch and vpc with the following) to behave more like backing-supporting drivers and not report 0 block-status. More over, returning ZERO status is absolutely valid thing, and again, corresponds to how the other format-drivers (backing-supporting) work. After block-status update, it never reports 0, so setting unallocated_blocks_are_zero doesn't make sense (as the only user of it is bdrv_co_block_status and it checks unallocated_blocks_are_zero only for unallocated areas). Drop it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-4-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block: inline bdrv_unallocated_blocks_are_zero()Vladimir Sementsov-Ogievskiy2020-07-061-3/+8
| | | | | | | | | | | The function has only one user: bdrv_co_block_status(). Inline it to simplify reviewing of the following patches, which will finally drop unallocated_blocks_are_zero field too. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20200528094405.145708-3-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/qcow2: implement blockdev-amendMaxim Levitsky2020-07-061-0/+39
| | | | | | | | | | | Currently the implementation only supports amending the encryption options, unlike the qemu-img version Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-14-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/crypto: implement blockdev-amendMaxim Levitsky2020-07-061-19/+53
| | | | | | | | Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-13-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/core: add generic infrastructure for x-blockdev-amend qmp commandMaxim Levitsky2020-07-062-1/+114
| | | | | | | | | | | | | blockdev-amend will be used similiar to blockdev-create to allow on the fly changes of the structure of the format based block devices. Current plan is to first support encryption keyslot management for luks based formats (raw and embedded in qcow2) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20200608094030.670121-12-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/qcow2: extend qemu-img amend interface with crypto optionsMaxim Levitsky2020-07-061-9/+62
| | | | | | | | | | | Now that we have all the infrastructure in place, wire it in the qcow2 driver and expose this to the user. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-9-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/crypto: implement the encryption key managementMaxim Levitsky2020-07-062-3/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements the encryption key management using the generic code in qcrypto layer and exposes it to the user via qemu-img This code adds another 'write_func' because the initialization write_func works directly on the underlying file, and amend works on instance of luks device. This commit also adds a 'hack/workaround' I and Kevin Wolf (thanks) made to make the driver both support write sharing (to avoid breaking the users), and be safe against concurrent metadata update (the keyslots) Eventually the write sharing for luks driver will be deprecated and removed together with this hack. The hack is that we ask (as a format driver) for BLK_PERM_CONSISTENT_READ and then when we want to update the keys, we unshare that permission. So if someone else has the image open, even readonly, encryption key update will fail gracefully. Also thanks to Daniel Berrange for the idea of unsharing read, rather that write permission which allows to avoid cases when the other user had opened the image read-only. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-8-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/crypto: rename two functionsMaxim Levitsky2020-07-061-13/+12Star
| | | | | | | | | | | | rename the write_func to create_write_func, and init_func to create_init_func. This is preparation for other write_func that will be used to update the encryption keys. No functional changes Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20200608094030.670121-7-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/amend: refactor qcow2 amend optionsMaxim Levitsky2020-07-061-100/+38Star
| | | | | | | | | | | | | | Some qcow2 create options can't be used for amend. Remove them from the qcow2 create options and add generic logic to detect such options in qemu-img Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> [mreitz: Dropped some iotests reference output hunks that became unnecessary thanks to "iotests: Make _filter_img_create more active"] Signed-off-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200625125548.870061-12-mreitz@redhat.com>
* block/amend: separate amend and create options for qemu-imgMaxim Levitsky2020-07-061-79/+94
| | | | | | | | | | | | | | | | | | | | Some options are only useful for creation (or hard to be amended, like cluster size for qcow2), while some other options are only useful for amend, like upcoming keyslot management options for luks Since currently only qcow2 supports amend, move all its options to a common macro and then include it in each action option list. In future it might be useful to remove some options which are not supported anyway from amend list, which currently cause an error message if amended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-5-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/amend: add 'force' optionMaxim Levitsky2020-07-061-0/+1
| | | | | | | | | | | | | | 'force' option will be used for some unsafe amend operations. This includes things like erasing last keyslot in luks based formats (which destroys the data, unless the master key is backed up by external means), but that _might_ be desired result. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200608094030.670121-4-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* qcrypto/core: add generic infrastructure for crypto options amendmentMaxim Levitsky2020-07-062-0/+20
| | | | | | | | | | | | This will be used first to implement luks keyslot management. block_crypto_amend_opts_init will be used to convert qemu-img cmdline to QCryptoBlockAmendOptions Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20200608094030.670121-2-mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* qcow2: Fix preallocation on images with unaligned sizesAlberto Garcia2020-07-061-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | When resizing an image with qcow2_co_truncate() using the falloc or full preallocation modes the code assumes that both the old and new sizes are cluster-aligned. There are two problems with this: 1) The calculation of how many clusters are involved does not always get the right result. Example: creating a 60KB image and resizing it (with preallocation=full) to 80KB won't allocate the second cluster. 2) No copy-on-write is performed, so in the previous example if there is a backing file then the first 60KB of the first cluster won't be filled with data from the backing file. This patch fixes both issues. Signed-off-by: Alberto Garcia <berto@igalia.com> Message-Id: <20200617140036.20311-1-berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block/block-copy: block_copy_dirty_clusters: fix failure checkVladimir Sementsov-Ogievskiy2020-07-061-1/+3
| | | | | | | | | | ret may be > 0 on success path at this point. Fix assertion, which may crash currently. Fixes: 4ce5dd3e9b5ee0fac18625860eb3727399ee965e Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200526181347.489557-1-vsementsov@virtuozzo.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* vvfat: Fix array_remove_slice()Kevin Wolf2020-07-031-37/+5Star
| | | | | | | | | | | | | | | | | | array_remove_slice() calls array_roll() with array->next - 1 as the destination index. This is only correct for count == 1, otherwise we're writing past the end of the array. array->next - count would be correct. However, this is the only place ever calling array_roll(), so this rather complicated operation isn't even necessary. Fix the problem and simplify the code by replacing it with a single memmove() call. array_roll() can now be removed. Reported-by: Nathan Huckleberry <nhuck15@gmail.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20200623175534.38286-3-kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* vvfat: Check that updated filenames are validKevin Wolf2020-07-031-1/+24
| | | | | | | | | | | | | | | | | FAT allows only a restricted set of characters in file names, and for some of the illegal characters, it's actually important that we catch them: If filenames can contain '/', the guest can construct filenames containing "../" and escape from the assigned vvfat directory. The same problem could arise if ".." was ever accepted as a literal filename. Fix this by adding a check that all filenames are valid in check_directory_consistency(). Reported-by: Nathan Huckleberry <nhuck15@gmail.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20200623175534.38286-2-kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block/nvme: support nested aio_poll()Stefan Hajnoczi2020-06-232-7/+58
| | | | | | | | | | | | | | | | | | | | QEMU block drivers are supposed to support aio_poll() from I/O completion callback functions. This means completion processing must be re-entrant. The standard approach is to schedule a BH during completion processing and cancel it at the end of processing. If aio_poll() is invoked by a callback function then the BH will run. The BH continues the suspended completion processing. All of this means that request A's cb() can synchronously wait for request B to complete. Previously the nvme block driver would hang because it didn't process completions from nested aio_poll(). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Message-id: 20200617132201.1832152-8-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: keep BDRVNVMeState pointer in NVMeQueuePairStefan Hajnoczi2020-06-231-32/+38
| | | | | | | | | | | | | Passing around both BDRVNVMeState and NVMeQueuePair is unwieldy. Reduce the number of function arguments by keeping the BDRVNVMeState pointer in NVMeQueuePair. This will come in handly when a BH is introduced in a later patch and only one argument can be passed to it. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20200617132201.1832152-7-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: clarify that free_req_queue is protected by q->lockStefan Hajnoczi2020-06-231-1/+1
| | | | | | | | | | Existing users access free_req_queue under q->lock. Document this. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20200617132201.1832152-6-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: switch to a NVMeRequest freelistStefan Hajnoczi2020-06-231-27/+54
| | | | | | | | | | | | | | | | | | | | | | There are three issues with the current NVMeRequest->busy field: 1. The busy field is accidentally accessed outside q->lock when request submission fails. 2. Waiters on free_req_queue are not woken when a request is returned early due to submission failure. 2. Finding a free request involves scanning all requests. This makes request submission O(n^2). Switch to an O(1) freelist that is always accessed under the lock. Also differentiate between NVME_QUEUE_SIZE, the actual SQ/CQ size, and NVME_NUM_REQS, the number of usable requests. This makes the code simpler than using NVME_QUEUE_SIZE everywhere and having to keep in mind that one slot is reserved. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Message-id: 20200617132201.1832152-5-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: don't access CQE after moving cq.headStefan Hajnoczi2020-06-231-1/+4
| | | | | | | | | | | | | | | | | | | Do not access a CQE after incrementing q->cq.head and releasing q->lock. It is unlikely that this causes problems in practice but it's a latent bug. The reason why it should be safe at the moment is that completion processing is not re-entrant and the CQ doorbell isn't written until the end of nvme_process_completion(). Make this change now because QEMU expects completion processing to be re-entrant and later patches will do that. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20200617132201.1832152-4-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: drop tautologous assertionStefan Hajnoczi2020-06-231-1/+0Star
| | | | | | | | | | | | | | | | | nvme_process_completion() explicitly checks cid so the assertion that follows is always true: if (cid == 0 || cid > NVME_QUEUE_SIZE) { ... continue; } assert(cid <= NVME_QUEUE_SIZE); Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20200617132201.1832152-3-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block/nvme: poll queues without q->lockStefan Hajnoczi2020-06-231-0/+12
| | | | | | | | | | | A lot of CPU time is spent simply locking/unlocking q->lock during polling. Check for completion outside the lock to make q->lock disappear from the profile. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Sergio Lopez <slp@redhat.com> Message-id: 20200617132201.1832152-2-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* qcow2: Tweak comments on qcow2_get_persistent_dirty_bitmap_sizeEric Blake2020-06-171-4/+5
| | | | | | | | | | | | | For now, we don't have persistent bitmaps in any other formats, but that might not be true in the future. Make it obvious that our incoming parameter is not necessarily a qcow2 image, and therefore is limited to just the bdrv_dirty_bitmap_* API calls (rather than probing into qcow2 internals). Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20200608190821.3293867-1-eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Refactor subdirectory recursion during makeEric Blake2020-06-171-0/+1
| | | | | | | | | | | Rather than listing block/monitor from the top-level Makefile.objs, we should instead list monitor from block/Makefile.objs. Suggested-by: Kevin Wolf <kwolf@redhat.com> Fixes: bb4e58c613 Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20200608173339.3244211-1-eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Call attention to truncation of long NBD exportsEric Blake2020-06-101-8/+13
| | | | | | | | | | | | | | | | | | | | | | | Commit 93676c88 relaxed our NBD client code to request export names up to the NBD protocol maximum of 4096 bytes without NUL terminator, even though the block layer can't store anything longer than 4096 bytes including NUL terminator for display to the user. Since this means there are some export names where we have to truncate things, we can at least try to make the truncation a bit more obvious for the user. Note that in spite of the truncated display name, we can still communicate with an NBD server using such a long export name; this was deemed nicer than refusing to even connect to such a server (since the server may not be under our control, and since determining our actual length limits gets tricky when nbd://host:port/export and nbd+unix:///export?socket=/path are themselves variable-length expansions beyond the export name but count towards the block layer name length). Reported-by: Xueqiang Wei <xuwei@redhat.com> Fixes: https://bugzilla.redhat.com/1843684 Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20200610163741.3745251-3-eblake@redhat.com>
* block: Factor out bdrv_run_co()Vladimir Sementsov-Ogievskiy2020-06-051-121/+72Star
| | | | | | | | | | | | | | | We have a few bdrv_*() functions that can either spawn a new coroutine and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are alreeady running in a coroutine. All of them duplicate basically the same code. Factor the common code into a new function bdrv_run_co(). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20200520144901.16589-1-vsementsov@virtuozzo.com [Factor out bdrv_run_co_entry too] Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>